docs/python_docs/python/tutorials/packages/gluon/blocks/init.md - mxnet - Git at Google

 <!--- Licensed to the Apache Software Foundation (ASF) under one -->
 <!--- or more contributor license agreements.  See the NOTICE file -->
 <!--- distributed with this work for additional information -->
 <!--- regarding copyright ownership.  The ASF licenses this file -->
 <!--- to you under the Apache License, Version 2.0 (the -->
 <!--- "License"); you may not use this file except in compliance -->
 <!--- with the License.  You may obtain a copy of the License at -->

 <!---   http://www.apache.org/licenses/LICENSE-2.0 -->

 <!--- Unless required by applicable law or agreed to in writing, -->
 <!--- software distributed under the License is distributed on an -->
 <!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
 <!--- KIND, either express or implied.  See the License for the -->
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->

 # Initialization

 <!-- adapted from diveintodeeplearning -->

 In the [Neural Networks](./nn.ipynb) section we played fast and loose with setting
 up our networks. In particular we did the following things that *shouldn't*
 work:

 * We defined the network architecture with no regard to the input
   dimensionality.
 * We added layers without regard to the output dimension of the previous layer.
 * We even 'initialized' these parameters without knowing how many parameters
   we were going to initialize.

 All of those things sound impossible and indeed, they are. After all, there's
 no way MXNet (or any other framework for that matter) could predict what the
 input dimensionality of a network would be. Later on, when working with
 convolutional networks and images this problem will become even more pertinent,
 since the input dimensionality (i.e. the resolution of an image) will affect
 the dimensionality of subsequent layers. The ability to
 determine parameter dimensionality during run-time rather than at coding time
 greatly simplifies the process of doing deep learning.

 ## Instantiating a Network

 Let's see what happens when we instantiate a network. We start by defining a multi-layer perceptron.

 ```{.python .input}
 from mxnet import init, np
 from mxnet.gluon import nn


 def getnet():
     net = nn.Sequential()
     net.add(nn.Dense(256, activation='relu'))
     net.add(nn.Dense(10))
     return net

 net = getnet()
 ```

 At this point the network doesn't really know yet what the dimensionalities of
 the various parameters should be. All one could tell at this point is that each
 layer needs weights and bias, albeit of unspecified dimensionality. If we try
 accessing the parameters, that's exactly what happens.

 ```{.python .input}
 print(net.collect_params())
 ```

 You'll notice `None` here in each `Dense` layer. This absence of value is how
 MXNet keeps track of unspecified dimensionality. In particular, trying to access
 `net[0].weight.data()` at this point would trigger a runtime error stating that
 the network needs initializing before it can do anything.

 Note that if we did want to specify dimensionality, we could have done so by
 using the kwarg `in_units`, e.g. `Dense(256, activiation='relu', in_units=20)`.

 Let's see whether anything changes after we initialize the parameters:


 ```{.python .input}
 net.initialize()
 net.collect_params()
 ```

 As we can see, nothing really changed. Only once we provide the network with
 some data do we see a difference. Let's try it out.

 ```{.python .input}
 x = np.random.uniform(size=(2, 20))
 net(x)  # Forward computation
 print(net.collect_params())
 ```

 We see all the dimensions have been determined and the parameters initialized.
 This is because shape inference and parameter initialization have been
 performed in a lazy manner, so they are performed only when needed. In the
 above case, they are performed as a prerequisite to the forward computation.

 Dimensional inference works like this: as soon as we knew the input
 dimensionality, $\mathbf{x} \in \mathbb{R}^{20}$ it was possible to define the
 weight matrix for the first layer, i.e. $\mathbf{W}_1 \in \mathbb{R}^{256 \times
 20}$. With that out of the way, we can progress to the second layer, define its
 dimensionality to be $10 \times 256$ and so on through the computational graph
 and resolve all the dimensions as they become available. Once this is known, we
 can proceed by initializing parameters. This is the solution to the three
 problems outlined above.


 ## Deferred Initialization in Practice

 Now that we know how it works in theory, let's see when the initialization is
 actually triggered. In order to do so, we mock up an initializer which does
 nothing but report a debug message stating when it was invoked and with which
 parameters.

 ```{.python .input  n=22}
 class MyInit(init.Initializer):
     def _init_weight(self, name, data):
         print('Init', name, data.shape)
         # The actual initialization logic is omitted here.

 net = getnet()
 net.initialize(init=MyInit())
 ```

 Note that, although `MyInit` will print information about the model parameters
 when it is called, the above `initialize` function does not print any
 information after it has been executed.  Therefore there is no actual
 initialization when calling the `initialize` function - this
 +initialization is deferred until forward is called for the first time. Next,
 we define the input and perform a forward calculation.

 ```{.python .input  n=25}
 x = np.random.uniform(size=(2, 20))
 y = net(x)
 ```

 At this time, information on the model parameters is printed. When performing a
 forward calculation based on the input `x`, the system can automatically infer
 the shape of the weight parameters of all layers based on the shape of the
 input. Once the system has created these parameters, it calls the `MyInit`
 instance to initialize them before proceeding to the forward calculation.

 Of course, this initialization will only be called when completing the initial
 forward calculation. After that, we will not re-initialize when we run the
 forward calculation `net(x)`, so the output of the `MyInit` instance will not be
 generated again.

 ```{.python .input}
 y = net(x)
 ```

 As mentioned at the beginning of this section, deferred initialization can also
 cause confusion. Before the first forward calculation, we were unable to
 directly manipulate the model parameters, for example, we could not use the
 `data` and `set_data` functions to get and modify the parameters. Therefore, we
 often force initialization by sending a sample observation through the network.

 ## Forced Initialization

 Deferred initialization does not occur if the system knows the shape of all
 parameters when calling the `initialize` function. This can occur in two cases:

 * We've already seen some data and we just want to reset the parameters.
 * We specified all input and output dimensions of the network or layer when
   defining it.

 The first case works just fine, as illustrated below.

 ```{.python .input}
 net.initialize(init=MyInit(), force_reinit=True)
 ```

 The second case requires us to specify the remaining set of parameters when
 creating the layer. For instance, for dense layers we also need to specify the
 `in_units` so that initialization can occur immediately once `initialize` is
 called.

 ```{.python .input}
 net = nn.Sequential()
 net.add(nn.Dense(256, in_units=20, activation='relu'))
 net.add(nn.Dense(10, in_units=256))

 net.initialize(init=MyInit())
 ```

 ## Parameter Initialization

 By default, MXNet initializes the weight matrices uniformly by drawing random
 values with uniform-distribution between $-0.07$ and $0.07$ ($U[-0.07, 0.07]$)
 and updates the bias parameters by setting them all to $0$.  However, we often
 need to use other methods to initialize the weights.  MXNet's `init` module
 provides a variety of preset initialization methods, but if we want something
 out of the ordinary, we need a bit of extra work.

 ### Built-in Initialization

 Let's begin with the built-in initializers. The code below initializes all
 parameters with Gaussian random variables.

 ```{.python .input  n=9}
 # force_reinit ensures that the variables are initialized again, regardless of
 # whether they were already initialized previously.
 net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
 print(net[0].weight.data()[0])
 ```

 If we wanted to initialize all parameters to $1$, we could do this simply by
 changing the initializer to `Constant(1)`.

 ```{.python .input  n=10}
 net.initialize(init=init.Constant(1), force_reinit=True)
 net[0].weight.data()[0]
 ```

 If we want to initialize only a specific parameter in a different manner, we
 can simply set the initializer only for the appropriate subblock (or
 parameter). For instance, below we initialize the second layer to a constant
 value of $42$ and we use the `Xavier` initializer for the weights of the
 first layer.

 ```{.python .input  n=11}
 net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
 net[1].initialize(init=init.Constant(42), force_reinit=True)

 # First layer
 print(net[0].weight.data()[0])
 print(net[0].bias.data()[0])  # initialized to 0

 # Second layer
 print(net[1].weight.data()[0,0])
 print(net[1].bias.data()[0])  # initialized to 0
 ```

 ### Custom Initialization

 Sometimes, the initialization methods we need are not provided in the `init`
 module. At this point, we can implement a subclass of the `Initializer` class
 so that we can use it like any other initialization method. Usually, we only
 need to implement the `_init_weight` function to suit our needs. In the example
 below, we pick a decidedly bizarre and nontrivial distribution, just to prove
 the point. We draw the coefficients from the following distribution:

 $$
 \begin{aligned}
     w \sim \begin{cases}
         U[5, 10] & \text{ with probability } \frac{1}{4} \\
             0    & \text{ with probability } \frac{1}{2} \\
         U[-10, -5] & \text{ with probability } \frac{1}{4}
     \end{cases}
 \end{aligned}
 $$

 ```{.python .input  n=12}
 class MyInit(init.Initializer):
     def _init_weight(self, name, data):
         print('Init', name, data.shape)
         data[:] = np.random.uniform(low=-10, high=10, size=data.shape)
         data *= np.abs(data) >= 5

 net.initialize(MyInit(), force_reinit=True)
 net[0].weight.data()[0]
 ```

 If this functionality is insufficient, we can even set parameters directly.
 Since `data()` returns an `NDArray` we can access it just like any other matrix.
 A note for advanced users - if you want to adjust parameters within an
 `autograd` scope you need to use `set_data` to avoid confusing the automatic
 differentiation mechanics.

 ```{.python .input  n=13}
 net[0].weight.data()[:] += 1
 net[0].weight.data()[0,0] = 42
 net[0].weight.data()[0]
 ```

 ## Tied Parameters

 In some cases, we want to share model parameters across multiple layers. For
 instance when we want to find good word embeddings we may decide to use the
 same parameters both for encoding and decoding of words. Let's see how to do
 this a bit more elegantly. In the following we construct a dense layer and then
 use its parameters specifically to set those of another layer.

 ```{.python .input  n=14}
 net = nn.Sequential()
 # We need to give the shared layer a name such that we can reference its
 # parameters.
 shared = nn.Dense(8, activation='relu')
 net.add(nn.Dense(8, activation='relu'),
         shared,
         nn.Dense(8, activation='relu').share_parameters(shared.params),
         nn.Dense(10))
 net.initialize()

 x = np.random.uniform(size=(2, 20))
 net(x)

 # Check whether the parameters are the same.
 print(net[1].weight.data()[0] == net[2].weight.data()[0])
 net[1].weight.data()[0,0] = 100
 # And make sure that they're actually the same object rather than just having
 # the same value.
 print(net[1].weight.data()[0] == net[2].weight.data()[0])
 ```

 The above example shows that the parameters of the second and third layer are
 tied. As Python objects, they are identical rather than just being equal.
 That is, by changing one of the parameters the other one changes too. What
 happens to the gradients is quite ingenious. Since the model parameters contain
 gradients, the gradients of the second hidden layer and the third hidden layer
 are accumulated in `shared.params.grad` during backpropagation.

 ## Conclusion

 In this tutorial you learnt how to initialize a neural network, and should now
 understand the difference between deferred and forced initialization. Some more advanced
 cases you should now be aware of include custom initialization and tied parameters.

 ## Recommended Next Steps

 * Check out the [API Docs](../../../../api/optimizer/index.rst) on initialization for a list of available initialization methods.
 * See [this tutorial](./naming.ipynb) for more information on Gluon Parameters.
	<!--- Licensed to the Apache Software Foundation (ASF) under one -->
	<!--- or more contributor license agreements. See the NOTICE file -->
	<!--- distributed with this work for additional information -->
	<!--- regarding copyright ownership. The ASF licenses this file -->
	<!--- to you under the Apache License, Version 2.0 (the -->
	<!--- "License"); you may not use this file except in compliance -->
	<!--- with the License. You may obtain a copy of the License at -->

	<!--- http://www.apache.org/licenses/LICENSE-2.0 -->

	<!--- Unless required by applicable law or agreed to in writing, -->
	<!--- software distributed under the License is distributed on an -->
	<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
	<!--- KIND, either express or implied. See the License for the -->
	<!--- specific language governing permissions and limitations -->
	<!--- under the License. -->

	# Initialization

	<!-- adapted from diveintodeeplearning -->

	In the [Neural Networks](./nn.ipynb) section we played fast and loose with setting
	up our networks. In particular we did the following things that shouldn't
	work:

	* We defined the network architecture with no regard to the input
	dimensionality.
	* We added layers without regard to the output dimension of the previous layer.
	* We even 'initialized' these parameters without knowing how many parameters
	we were going to initialize.

	All of those things sound impossible and indeed, they are. After all, there's
	no way MXNet (or any other framework for that matter) could predict what the
	input dimensionality of a network would be. Later on, when working with
	convolutional networks and images this problem will become even more pertinent,
	since the input dimensionality (i.e. the resolution of an image) will affect
	the dimensionality of subsequent layers. The ability to
	determine parameter dimensionality during run-time rather than at coding time
	greatly simplifies the process of doing deep learning.

	## Instantiating a Network

	Let's see what happens when we instantiate a network. We start by defining a multi-layer perceptron.

	```{.python .input}
	from mxnet import init, np
	from mxnet.gluon import nn


	def getnet():
	net = nn.Sequential()
	net.add(nn.Dense(256, activation='relu'))
	net.add(nn.Dense(10))
	return net

	net = getnet()
	```

	At this point the network doesn't really know yet what the dimensionalities of
	the various parameters should be. All one could tell at this point is that each
	layer needs weights and bias, albeit of unspecified dimensionality. If we try
	accessing the parameters, that's exactly what happens.

	```{.python .input}
	print(net.collect_params())
	```

	You'll notice `None` here in each `Dense` layer. This absence of value is how
	MXNet keeps track of unspecified dimensionality. In particular, trying to access
	`net[0].weight.data()` at this point would trigger a runtime error stating that
	the network needs initializing before it can do anything.

	Note that if we did want to specify dimensionality, we could have done so by
	using the kwarg `in_units`, e.g. `Dense(256, activiation='relu', in_units=20)`.

	Let's see whether anything changes after we initialize the parameters:


	```{.python .input}
	net.initialize()
	net.collect_params()
	```

	As we can see, nothing really changed. Only once we provide the network with
	some data do we see a difference. Let's try it out.

	```{.python .input}
	x = np.random.uniform(size=(2, 20))
	net(x) # Forward computation
	print(net.collect_params())
	```

	We see all the dimensions have been determined and the parameters initialized.
	This is because shape inference and parameter initialization have been
	performed in a lazy manner, so they are performed only when needed. In the
	above case, they are performed as a prerequisite to the forward computation.

	Dimensional inference works like this: as soon as we knew the input
	dimensionality, $\mathbf{x} \in \mathbb{R}^{20}$ it was possible to define the
	weight matrix for the first layer, i.e. $\mathbf{W}_1 \in \mathbb{R}^{256 \times
	20}$. With that out of the way, we can progress to the second layer, define its
	dimensionality to be $10 \times 256$ and so on through the computational graph
	and resolve all the dimensions as they become available. Once this is known, we
	can proceed by initializing parameters. This is the solution to the three
	problems outlined above.


	## Deferred Initialization in Practice

	Now that we know how it works in theory, let's see when the initialization is
	actually triggered. In order to do so, we mock up an initializer which does
	nothing but report a debug message stating when it was invoked and with which
	parameters.

	```{.python .input n=22}
	class MyInit(init.Initializer):
	def _init_weight(self, name, data):
	print('Init', name, data.shape)
	# The actual initialization logic is omitted here.

	net = getnet()
	net.initialize(init=MyInit())
	```

	Note that, although `MyInit` will print information about the model parameters
	when it is called, the above `initialize` function does not print any
	information after it has been executed. Therefore there is no actual
	initialization when calling the `initialize` function - this
	+initialization is deferred until forward is called for the first time. Next,
	we define the input and perform a forward calculation.

	```{.python .input n=25}
	x = np.random.uniform(size=(2, 20))
	y = net(x)
	```

	At this time, information on the model parameters is printed. When performing a
	forward calculation based on the input `x`, the system can automatically infer
	the shape of the weight parameters of all layers based on the shape of the
	input. Once the system has created these parameters, it calls the `MyInit`
	instance to initialize them before proceeding to the forward calculation.

	Of course, this initialization will only be called when completing the initial
	forward calculation. After that, we will not re-initialize when we run the
	forward calculation `net(x)`, so the output of the `MyInit` instance will not be
	generated again.

	```{.python .input}
	y = net(x)
	```

	As mentioned at the beginning of this section, deferred initialization can also
	cause confusion. Before the first forward calculation, we were unable to
	directly manipulate the model parameters, for example, we could not use the
	`data` and `set_data` functions to get and modify the parameters. Therefore, we
	often force initialization by sending a sample observation through the network.

	## Forced Initialization

	Deferred initialization does not occur if the system knows the shape of all
	parameters when calling the `initialize` function. This can occur in two cases:

	* We've already seen some data and we just want to reset the parameters.
	* We specified all input and output dimensions of the network or layer when
	defining it.

	The first case works just fine, as illustrated below.

	```{.python .input}
	net.initialize(init=MyInit(), force_reinit=True)
	```

	The second case requires us to specify the remaining set of parameters when
	creating the layer. For instance, for dense layers we also need to specify the
	`in_units` so that initialization can occur immediately once `initialize` is
	called.

	```{.python .input}
	net = nn.Sequential()
	net.add(nn.Dense(256, in_units=20, activation='relu'))
	net.add(nn.Dense(10, in_units=256))

	net.initialize(init=MyInit())
	```

	## Parameter Initialization

	By default, MXNet initializes the weight matrices uniformly by drawing random
	values with uniform-distribution between $-0.07$ and $0.07$ ($U[-0.07, 0.07]$)
	and updates the bias parameters by setting them all to $0$. However, we often
	need to use other methods to initialize the weights. MXNet's `init` module
	provides a variety of preset initialization methods, but if we want something
	out of the ordinary, we need a bit of extra work.

	### Built-in Initialization

	Let's begin with the built-in initializers. The code below initializes all
	parameters with Gaussian random variables.

	```{.python .input n=9}
	# force_reinit ensures that the variables are initialized again, regardless of
	# whether they were already initialized previously.
	net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)
	print(net[0].weight.data()[0])
	```

	If we wanted to initialize all parameters to $1$, we could do this simply by
	changing the initializer to `Constant(1)`.

	```{.python .input n=10}
	net.initialize(init=init.Constant(1), force_reinit=True)
	net[0].weight.data()[0]
	```

	If we want to initialize only a specific parameter in a different manner, we
	can simply set the initializer only for the appropriate subblock (or
	parameter). For instance, below we initialize the second layer to a constant
	value of $42$ and we use the `Xavier` initializer for the weights of the
	first layer.

	```{.python .input n=11}
	net[0].weight.initialize(init=init.Xavier(), force_reinit=True)
	net[1].initialize(init=init.Constant(42), force_reinit=True)

	# First layer
	print(net[0].weight.data()[0])
	print(net[0].bias.data()[0]) # initialized to 0

	# Second layer
	print(net[1].weight.data()[0,0])
	print(net[1].bias.data()[0]) # initialized to 0
	```

	### Custom Initialization

	Sometimes, the initialization methods we need are not provided in the `init`
	module. At this point, we can implement a subclass of the `Initializer` class
	so that we can use it like any other initialization method. Usually, we only
	need to implement the `_init_weight` function to suit our needs. In the example
	below, we pick a decidedly bizarre and nontrivial distribution, just to prove
	the point. We draw the coefficients from the following distribution:

	$$
	\begin{aligned}
	w \sim \begin{cases}
	U[5, 10] & \text{ with probability } \frac{1}{4} \\
	0 & \text{ with probability } \frac{1}{2} \\
	U[-10, -5] & \text{ with probability } \frac{1}{4}
	\end{cases}
	\end{aligned}
	$$

	```{.python .input n=12}
	class MyInit(init.Initializer):
	def _init_weight(self, name, data):
	print('Init', name, data.shape)
	data[:] = np.random.uniform(low=-10, high=10, size=data.shape)
	data *= np.abs(data) >= 5

	net.initialize(MyInit(), force_reinit=True)
	net[0].weight.data()[0]
	```

	If this functionality is insufficient, we can even set parameters directly.
	Since `data()` returns an `NDArray` we can access it just like any other matrix.
	A note for advanced users - if you want to adjust parameters within an
	`autograd` scope you need to use `set_data` to avoid confusing the automatic
	differentiation mechanics.

	```{.python .input n=13}
	net[0].weight.data()[:] += 1
	net[0].weight.data()[0,0] = 42
	net[0].weight.data()[0]
	```

	## Tied Parameters

	In some cases, we want to share model parameters across multiple layers. For
	instance when we want to find good word embeddings we may decide to use the
	same parameters both for encoding and decoding of words. Let's see how to do
	this a bit more elegantly. In the following we construct a dense layer and then
	use its parameters specifically to set those of another layer.

	```{.python .input n=14}
	net = nn.Sequential()
	# We need to give the shared layer a name such that we can reference its
	# parameters.
	shared = nn.Dense(8, activation='relu')
	net.add(nn.Dense(8, activation='relu'),
	shared,
	nn.Dense(8, activation='relu').share_parameters(shared.params),
	nn.Dense(10))
	net.initialize()

	x = np.random.uniform(size=(2, 20))
	net(x)

	# Check whether the parameters are the same.
	print(net[1].weight.data()[0] == net[2].weight.data()[0])
	net[1].weight.data()[0,0] = 100
	# And make sure that they're actually the same object rather than just having
	# the same value.
	print(net[1].weight.data()[0] == net[2].weight.data()[0])
	```

	The above example shows that the parameters of the second and third layer are
	tied. As Python objects, they are identical rather than just being equal.
	That is, by changing one of the parameters the other one changes too. What
	happens to the gradients is quite ingenious. Since the model parameters contain
	gradients, the gradients of the second hidden layer and the third hidden layer
	are accumulated in `shared.params.grad` during backpropagation.

	## Conclusion

	In this tutorial you learnt how to initialize a neural network, and should now
	understand the difference between deferred and forced initialization. Some more advanced
	cases you should now be aware of include custom initialization and tied parameters.

	## Recommended Next Steps

	* Check out the [API Docs](../../../../api/optimizer/index.rst) on initialization for a list of available initialization methods.
	* See [this tutorial](./naming.ipynb) for more information on Gluon Parameters.