blob: a2e5a52014552a29efe711bb7d061b50533b63de [file] [log] [blame] [view]
<!--- Licensed to the Apache Software Foundation (ASF) under one -->
<!--- or more contributor license agreements. See the NOTICE file -->
<!--- distributed with this work for additional information -->
<!--- regarding copyright ownership. The ASF licenses this file -->
<!--- to you under the Apache License, Version 2.0 (the -->
<!--- "License"); you may not use this file except in compliance -->
<!--- with the License. You may obtain a copy of the License at -->
<!--- http://www.apache.org/licenses/LICENSE-2.0 -->
<!--- Unless required by applicable law or agreed to in writing, -->
<!--- software distributed under the License is distributed on an -->
<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
<!--- KIND, either express or implied. See the License for the -->
<!--- specific language governing permissions and limitations -->
<!--- under the License. -->
# Initialization
<!-- adapted from diveintodeeplearning -->
In the [Neural Networks](./nn.ipynb) section we played fast and loose with setting
up our networks. In particular we did the following things that *shouldn't*
work:
* We defined the network architecture with no regard to the input
dimensionality.
* We added layers without regard to the output dimension of the previous layer.
* We even 'initialized' these parameters without knowing how many parameters
we were going to initialize.
All of those things sound impossible and indeed, they are. After all, there's
no way MXNet (or any other framework for that matter) could predict what the
input dimensionality of a network would be. Later on, when working with
convolutional networks and images this problem will become even more pertinent,
since the input dimensionality (i.e. the resolution of an image) will affect
the dimensionality of subsequent layers. The ability to
determine parameter dimensionality during run-time rather than at coding time
greatly simplifies the process of doing deep learning.
## Instantiating a Network
Let's see what happens when we instantiate a network. We start by defining a multi-layer perceptron.
```{.python .input}
from mxnet import init, np
from mxnet.gluon import nn
def getnet():
net = nn.Sequential()
net.add(nn.Dense(256, activation='relu'))
net.add(nn.Dense(10))
return net
net = getnet()
```
At this point the network doesn't really know yet what the dimensionalities of
the various parameters should be. All one could tell at this point is that each
layer needs weights and bias, albeit of unspecified dimensionality. If we try
accessing the parameters, that's exactly what happens.
```{.python .input}
print(net.collect_params())
```
You'll notice `None` here in each `Dense` layer. This absence of value is how
MXNet keeps track of unspecified dimensionality. In particular, trying to access
`net[0].weight.data()` at this point would trigger a runtime error stating that
the network needs initializing before it can do anything.
Note that if we did want to specify dimensionality, we could have done so by
using the kwarg `in_units`, e.g. `Dense(256, activiation='relu', in_units=20)`.
Let's see whether anything changes after we initialize the parameters:
```{.python .input}
net.initialize()
net.collect_params()
```
As we can see, nothing really changed. Only once we provide the network with
some data do we see a difference. Let's try it out.
```{.python .input}
x = np.random.uniform(size=(2, 20))
net(x) # Forward computation
print(net.collect_params())
```
We see all the dimensions have been determined and the parameters initialized.
This is because shape inference and parameter initialization have been
performed in a lazy manner, so they are performed only when needed. In the
above case, they are performed as a prerequisite to the forward computation.
Dimensional inference works like this: as soon as we knew the input
dimensionality, $\mathbf{x} \in \mathbb{R}^{20}$ it was possible to define the
weight matrix for the first layer, i.e. $\mathbf{W}_1 \in \mathbb{R}^{256 \times
20}$. With that out of the way, we can progress to the second layer, define its
dimensionality to be $10 \times 256$ and so on through the computational graph
and resolve all the dimensions as they become available. Once this is known, we
can proceed by initializing parameters. This is the solution to the three
problems outlined above.
## Deferred Initialization in Practice
Now that we know how it works in theory, let's see when the initialization is
actually triggered. In order to do so, we mock up an initializer which does
nothing but report a debug message stating when it was invoked and with which
parameters.
```{.python .input n=22}
class MyInit(init.Initializer):
def _init_weight(self, name, data):
print('Init', name, data.shape)
# The actual initialization logic is omitted here.
net = getnet()
net.initialize(init=MyInit())
```
Note that, although `MyInit` will print information about the model parameters
when it is called, the above `initialize` function does not print any
information after it has been executed. Therefore there is no actual
initialization when calling the `initialize` function - this
+initialization is deferred until forward is called for the first time. Next,
we define the input and perform a forward calculation.
```{.python .input n=25}
x = np.random.uniform(size=(2, 20))
y = net(x)
```
At this time, information on the model parameters is printed. When performing a
forward calculation based on the input `x`, the system can automatically infer
the shape of the weight parameters of all layers based on the shape of the
input. Once the system has created these parameters, it calls the `MyInit`
instance to initialize them before proceeding to the forward calculation.
Of course, this initialization will only be called when completing the initial
forward calculation. After that, we will not re-initialize when we run the
forward calculation `net(x)`, so the output of the `MyInit` instance will not be
generated again.
```{.python .input}
y = net(x)
```
As mentioned at the beginning of this section, deferred initialization can also
cause confusion. Before the first forward calculation, we were unable to
directly manipulate the model parameters, for example, we could not use the
`data` and `set_data` functions to get and modify the parameters. Therefore, we
often force initialization by sending a sample observation through the network.
## Forced Initialization
Deferred initialization does not occur if the system knows the shape of all
parameters when calling the `initialize` function. This can occur in two cases:
* We've already seen some data and we just want to reset the parameters.
* We specified all input and output dimensions of the network or layer when
defining it.
The first case works just fine, as illustrated below.
```{.python .input}
net.initialize(init=MyInit(), force_reinit=True)
```
The second case requires us to specify the remaining set of parameters when
creating the layer. For instance, for dense layers we also need to specify the
`in_units` so that initialization can occur immediately once `initialize` is
called.
```{.python .input}
net = nn.Sequential()
net.add(nn.Dense(256, in_units=20, activation='relu'))
net.add(nn.Dense(10, in_units=256))
net.initialize(init=MyInit())
```
## Parameter Initialization
By default, MXNet initializes the weight matrices uniformly by drawing random
values with uniform-distribution between $-0.07$ and $0.07$ ($U[-0.07, 0.07]$)
and updates the bias parameters by setting them all to $0$. However, we often
need to use other methods to initialize the weights. MXNet's `init` module
provides a variety of preset initialization methods, but if we want something
out of the ordinary, we need a bit of extra work.
### Built-in Initialization
Let's begin with the built-in initializers. The code below initializes all
parameters with Gaussian random variables.
```{.python .input n=9}
# force_reinit ensures that the variables are initialized again, regardless of
# whether they were already initialized previously.
net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
print(net[0].weight.data()[0])
```
If we wanted to initialize all parameters to $1$, we could do this simply by
changing the initializer to `Constant(1)`.
```{.python .input n=10}
net.initialize(init=init.Constant(1), force_reinit=True)
net[0].weight.data()[0]
```
If we want to initialize only a specific parameter in a different manner, we
can simply set the initializer only for the appropriate subblock (or
parameter). For instance, below we initialize the second layer to a constant
value of $42$ and we use the `Xavier` initializer for the weights of the
first layer.
```{.python .input n=11}
net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
net[1].initialize(init=init.Constant(42), force_reinit=True)
# First layer
print(net[0].weight.data()[0])
print(net[0].bias.data()[0]) # initialized to 0
# Second layer
print(net[1].weight.data()[0,0])
print(net[1].bias.data()[0]) # initialized to 0
```
### Custom Initialization
Sometimes, the initialization methods we need are not provided in the `init`
module. At this point, we can implement a subclass of the `Initializer` class
so that we can use it like any other initialization method. Usually, we only
need to implement the `_init_weight` function to suit our needs. In the example
below, we pick a decidedly bizarre and nontrivial distribution, just to prove
the point. We draw the coefficients from the following distribution:
$$
\begin{aligned}
w \sim \begin{cases}
U[5, 10] & \text{ with probability } \frac{1}{4} \\
0 & \text{ with probability } \frac{1}{2} \\
U[-10, -5] & \text{ with probability } \frac{1}{4}
\end{cases}
\end{aligned}
$$
```{.python .input n=12}
class MyInit(init.Initializer):
def _init_weight(self, name, data):
print('Init', name, data.shape)
data[:] = np.random.uniform(low=-10, high=10, size=data.shape)
data *= np.abs(data) >= 5
net.initialize(MyInit(), force_reinit=True)
net[0].weight.data()[0]
```
If this functionality is insufficient, we can even set parameters directly.
Since `data()` returns an `NDArray` we can access it just like any other matrix.
A note for advanced users - if you want to adjust parameters within an
`autograd` scope you need to use `set_data` to avoid confusing the automatic
differentiation mechanics.
```{.python .input n=13}
net[0].weight.data()[:] += 1
net[0].weight.data()[0,0] = 42
net[0].weight.data()[0]
```
## Tied Parameters
In some cases, we want to share model parameters across multiple layers. For
instance when we want to find good word embeddings we may decide to use the
same parameters both for encoding and decoding of words. Let's see how to do
this a bit more elegantly. In the following we construct a dense layer and then
use its parameters specifically to set those of another layer.
```{.python .input n=14}
net = nn.Sequential()
# We need to give the shared layer a name such that we can reference its
# parameters.
shared = nn.Dense(8, activation='relu')
net.add(nn.Dense(8, activation='relu'),
shared,
nn.Dense(8, activation='relu').share_parameters(shared.params),
nn.Dense(10))
net.initialize()
x = np.random.uniform(size=(2, 20))
net(x)
# Check whether the parameters are the same.
print(net[1].weight.data()[0] == net[2].weight.data()[0])
net[1].weight.data()[0,0] = 100
# And make sure that they're actually the same object rather than just having
# the same value.
print(net[1].weight.data()[0] == net[2].weight.data()[0])
```
The above example shows that the parameters of the second and third layer are
tied. As Python objects, they are identical rather than just being equal.
That is, by changing one of the parameters the other one changes too. What
happens to the gradients is quite ingenious. Since the model parameters contain
gradients, the gradients of the second hidden layer and the third hidden layer
are accumulated in `shared.params.grad` during backpropagation.
## Conclusion
In this tutorial you learnt how to initialize a neural network, and should now
understand the difference between deferred and forced initialization. Some more advanced
cases you should now be aware of include custom initialization and tied parameters.
## Recommended Next Steps
* Check out the [API Docs](../../../../api/optimizer/index.rst) on initialization for a list of available initialization methods.
* See [this tutorial](./naming.ipynb) for more information on Gluon Parameters.