docs/python_docs/python/tutorials/packages/gluon/blocks/nn.md - mxnet - Git at Google

 <!--- Licensed to the Apache Software Foundation (ASF) under one -->
 <!--- or more contributor license agreements.  See the NOTICE file -->
 <!--- distributed with this work for additional information -->
 <!--- regarding copyright ownership.  The ASF licenses this file -->
 <!--- to you under the Apache License, Version 2.0 (the -->
 <!--- "License"); you may not use this file except in compliance -->
 <!--- with the License.  You may obtain a copy of the License at -->

 <!---   http://www.apache.org/licenses/LICENSE-2.0 -->

 <!--- Unless required by applicable law or agreed to in writing, -->
 <!--- software distributed under the License is distributed on an -->
 <!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
 <!--- KIND, either express or implied.  See the License for the -->
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->

 # Layers and Blocks

 <!-- adapted from diveintodeeplearning -->

 As network complexity increases, we move from designing single to entire layers
 of neurons.

 Neural network designs like
 [ResNet-152](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)
 have a fair degree of regularity. They consist of *blocks* of repeated (or at
 least similarly designed) layers; these blocks then form the basis of more
 complex network designs.

 In this section, we'll talk about how to write code that makes such blocks on
 demand, just like a Lego factory generates blocks which can be combined to
 produce terrific artifacts.

 We start with a very simple block, namely the block for a multilayer
 perceptron. A common strategy would be to construct a two-layer network as
 follows:

 ```{.python .input  n=1}
 import mxnet as mx
 from mxnet import np, npx
 from mxnet.gluon import nn, Block, Parameter, Constant


 x = np.random.uniform(size=(2, 20))

 net = nn.Sequential()
 net.add(nn.Dense(256, activation='relu'))
 net.add(nn.Dense(10))
 net.initialize()
 net(x)
 ```

 This generates a network with a hidden layer of $256$ units, followed by a ReLU
 activation and another $10$ units governing the output. In particular, we used
 the [nn.Sequential](../../../../api/gluon/nn/index.rst#mxnet.gluon.nn.Sequential)
 constructor to generate an empty network into which we then inserted both
 layers. What exactly happens inside `nn.Sequential`
 has remained rather mysterious so far. In the following we will see that this
 really just constructs a block that is a container for other blocks. These
 blocks can be combined into larger artifacts, often recursively. The diagram
 below shows how:

 ![Blocks can be used recursively to form larger artifacts](/_static/blocks.svg)

 In the following we will explain the various steps needed to go from defining
 layers to defining blocks (of one or more layers):

 1. Blocks take data as input.
 1. Blocks store state in the form of parameters that are inherent to the block.
    For instance, the block above contains two hidden layers, and we need a
    place to store parameters for it.
 1. Blocks produce meaningful output. This is typically encoded in what
    we will call the `forward` function. It allows us to invoke a block via
    `net(X)` to obtain the desired output. What happens behind the scenes is
    that it invokes `forward` to perform forward propagation (also called
    forward computation).
 1. Blocks initialize the parameters in a lazy fashion as part of the first
    `forward` call.
 1. Blocks calculate a gradient with regard to their input when invoking
    `backward`. Typically this is automatic.

 ## A Sequential Block

 The [Block](../../../../api/gluon/block.rst#mxnet.gluon.Block) class is a
 generic component describing data flow. When the data flows through a sequence
 of blocks, each block applied to the output of the one before with the first
 block being applied on the input data itself, we have a special kind of block,
 namely the `Sequential` block.

 `Sequential` has helper methods to manage the sequence, with `add` being the
 main one of interest allowing you to append blocks in sequence. Once the
 operations have been added, the forward computation of the model applies the
 blocks on the input data in the order they were added.  Below, we implement a
 `MySequential` class that has the same functionality as the `Sequential` class.
 This may help you understand more clearly how the `Sequential` class works.

 ```{.python .input  n=3}
 class MySequential(Block):
     def __init__(self):
         super(MySequential, self).__init__()
         self._layers = []

     def add(self, block):
         # Here, block is an instance of a Block subclass, and we assume it has a unique name. We save it in the
         # member variable _layers of the Block class, and its type is List. When the MySequential instance
         # calls the initialize function, the system automatically initializes all members of _layers.
         self._layers.append(block)
         self.register_child(block)

     def forward(self, x):
         # OrderedDict guarantees that members will be traversed in the order they were added.
         for block in self._children.values():
             x = block()(x)
         return x
 ```

 At its core is the `add` method. It adds any block to the ordered dictionary of
 children. These are then executed in sequence when forward propagation is
 invoked. Let's see what the MLP looks like now.

 ```{.python .input  n=4}
 net = MySequential()
 net.add(nn.Dense(256, activation='relu'))
 net.add(nn.Dense(10))
 net.initialize()
 net(x)
 ```

 Indeed, it is no different than It can observed here that the use of the
 `MySequential` class is no different from the use of the Sequential class.


 ## A Custom Block

 It is easy to go beyond simple concatenation with `Sequential`. The
 `Block` class provides the functionality required to make such customizations.
 `Block` has a model constructor provided in the `nn` module, which we can
 inherit to define the model we want. The following inherits the `Block` class to
 construct the multilayer perceptron mentioned at the beginning of this section.
 The `MLP` class defined here overrides the `__init__` and `forward` functions
 of the Block class. They are used to create model parameters and define forward
 computations, respectively. Forward computation is also forward propagation.

 ```{.python .input  n=1}
 class MLP(nn.Block):
     # Declare a layer with model parameters. Here, we declare two fully
     # connected layers.

     def __init__(self, **kwargs):
         # Call the constructor of the MLP parent class Block to perform the
         # necessary initialization. In this way, other function parameters can
         # also be specified when constructing an instance, such as the model
         # parameter, params, described in the following sections.
         super(MLP, self).__init__(**kwargs)
         self.hidden = nn.Dense(256, activation='relu')  # Hidden layer
         self.output = nn.Dense(10)  # Output layer

     # Define the forward computation of the model, that is, how to return the
     # required model output based on the input x.

     def forward(self, x):
         hidden_out = self.hidden(x)
         return self.output(hidden_out)
 ```

 Let's look at it a bit more closely. The `forward` method invokes a network
 simply by evaluating the hidden layer `self.hidden(x)` and subsequently by
 evaluating the output layer `self.output( ... )`. This is what we expect in the
 forward pass of this block.

 In order for the block to know what it needs to evaluate, we first need to
 define the layers. This is what the `__init__` method does. It first
 initializes all of the Block-related parameters and then constructs the
 requisite layers. This attaches the coresponding layers and the required
 parameters to the class. Note that there is no need to define a backpropagation
 method in the class. The system automatically generates the `backward` method
 needed for back propagation by automatically finding the gradient (see the tutorial
 on [autograd](../../autograd/index.ipynb)). The same applies to the
 [initialize](../../../../api/gluon/nn/index.rst#mxnet.gluon.nn.Block.initialize)
 method, which is generated automatically. Let's try
 this out:

 ```{.python .input  n=2}
 net = MLP()
 net.initialize()
 net(x)
 ```

 As explained above, the `Block` class can be quite versatile in terms of what it
 does. For instance, its subclass can be a layer (such as the `Dense` class
 provided by Gluon), it can be a model (such as the `MLP` class we just derived),
 or it can be a part of a model (this is what typically happens when designing
 very deep networks). Throughout this chapter we will see how to use this with
 great flexibility.


 ## Coding with `Blocks`

 ### Blocks
 The [Sequential](../../../../api/gluon/nn/index.rst#mxnet.gluon.nn.Sequential) class
 can make model construction easier and does not require you to define the
 `forward` method; however, directly inheriting from
 its parent class, [Block](../../../../api/gluon/block.rst#mxnet.gluon.Block), can greatly
 expand the flexibility of model construction. For example, implementing the
 `forward` method means you can introduce control flow in the network.

 ### Constant parameters
 Now we'd like to introduce the notation of a *constant* parameter. These are
 parameters that are not used when invoking backpropagation. This sounds very
 abstract but here's what's really going on.
 Assume that we have some function

 $$f(\mathbf{x},\mathbf{w}) = 3 \cdot \mathbf{w}^\top \mathbf{x}.$$

 In this case $3$ is a constant parameter. We could change $3$ to something else,
 say $c$ via

 $$f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}.$$

 Nothing has really changed, except that we can adjust the value of $c$. It is
 still a constant as far as $\mathbf{w}$ and $\mathbf{x}$ are concerned. However,
 Gluon doesn't know about this unless we create it with `get_constant`
 (this makes the code go faster, too, since we're not sending the Gluon engine
 on a wild goose chase after a parameter that doesn't change).

 ```{.python .input  n=5}
 class FancyMLP(nn.Block):
     def __init__(self, **kwargs):
         super(FancyMLP, self).__init__(**kwargs)

         # Random weight parameters created with the get_constant are not
         # iterated during training (i.e. constant parameters).
         self.rand_weight = Constant(np.random.uniform(size=(20, 20)))
         self.dense = nn.Dense(20, activation='relu')

     def forward(self, x):
         x = self.dense(x)
         # Use the constant parameters created, as well as the ReLU and dot
         # functions of NDArray.

         x = npx.relu(np.dot(x, self.rand_weight.data()) + 1)
         # Re-use the fully connected layer. This is equivalent to sharing
         # parameters with two fully connected layers.
         x = self.dense(x)
         # Here in the control flow, we need to call `item` to return the
         # scalar for comparison.

         while npx.norm(x).item() > 1:
             x /= 2
         if npx.norm(x).item() < 0.8:
             x *= 10
         return x.sum()
 ```

 In this `FancyMLP` model, we used constant weight `rand_weight` (note that it is
 not a model parameter), performed a matrix multiplication operation (`nd.dot`),
 and reused the *same* `Dense` layer. Note that this is very different from using
 two dense layers with different sets of parameters. Instead, we used the same
 network twice. Quite often in deep networks one also says that the parameters
 are *tied* when one wants to express that multiple parts of a network share the
 same parameters. Let's see what happens if we construct it and feed data through
 it.

 ```{.python .input  n=6}
 net = FancyMLP()
 net.initialize()
 net(x)
 ```

 There's no reason why we couldn't mix and match these ways of building a
 network. Obviously the example below resembles a [Rube Goldberg
 Machine](https://en.wikipedia.org/wiki/Rube_Goldberg_machine). That said, it
 combines examples for building a block from individual blocks,
 which in turn, may be blocks themselves. Furthermore, we can even combine
 multiple strategies inside the same forward function. To demonstrate this,
 here's the network.

 ```{.python .input  n=7}
 class NestMLP(nn.Block):
     def __init__(self, **kwargs):
         super(NestMLP, self).__init__(**kwargs)
         self.net = nn.Sequential()
         self.net.add(nn.Dense(64, activation='relu'),
                      nn.Dense(32, activation='relu'))
         self.dense = nn.Dense(16, activation='relu')

     def forward(self, x):
         return self.dense(self.net(x))

 chimera = nn.Sequential()
 chimera.add(NestMLP(), nn.Dense(20), FancyMLP())

 chimera.initialize()
 chimera(x)
 ```

 ## Hybridization

 The reader may be starting to think about the efficiency of this Python code.
 After all, we have lots of dictionary lookups, code execution, and lots of
 other Pythonic things going on in what is supposed to be a high performance
 deep learning library. The problems of Python's [Global Interpreter
 Lock](https://wiki.python.org/moin/GlobalInterpreterLock) are well
 known.

 In the device of deep learning, we often have highly performant GPUs that
 depend on CPUs running Python to tell them what to do. This mismatch can
 manifest in the form of GPU starvation when the CPUs can not provide
 instruction fast enough. We can improve this situation by deferring to a more
 performant language instead of Python when possible.

 Gluon does this by allowing for [Hybridization](hybridize.ipynb). In it, the
 Python interpreter executes the block the first time it's invoked. The Gluon
 runtime records what is happening and the next time around it short circuits
 any calls to Python. This can accelerate things considerably in some cases but
 care needs to be taken with [control flow](../../autograd/index.ipynb#Advanced:-Using-Python-control-flow).
	<!--- Licensed to the Apache Software Foundation (ASF) under one -->
	<!--- or more contributor license agreements. See the NOTICE file -->
	<!--- distributed with this work for additional information -->
	<!--- regarding copyright ownership. The ASF licenses this file -->
	<!--- to you under the Apache License, Version 2.0 (the -->
	<!--- "License"); you may not use this file except in compliance -->
	<!--- with the License. You may obtain a copy of the License at -->

	<!--- http://www.apache.org/licenses/LICENSE-2.0 -->

	<!--- Unless required by applicable law or agreed to in writing, -->
	<!--- software distributed under the License is distributed on an -->
	<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
	<!--- KIND, either express or implied. See the License for the -->
	<!--- specific language governing permissions and limitations -->
	<!--- under the License. -->

	# Layers and Blocks

	<!-- adapted from diveintodeeplearning -->

	As network complexity increases, we move from designing single to entire layers
	of neurons.

	Neural network designs like
	[ResNet-152](https://www.cv-foundation.org/openaccess/content_cvpr_2016/papers/He_Deep_Residual_Learning_CVPR_2016_paper.pdf)
	have a fair degree of regularity. They consist of blocks of repeated (or at
	least similarly designed) layers; these blocks then form the basis of more
	complex network designs.

	In this section, we'll talk about how to write code that makes such blocks on
	demand, just like a Lego factory generates blocks which can be combined to
	produce terrific artifacts.

	We start with a very simple block, namely the block for a multilayer
	perceptron. A common strategy would be to construct a two-layer network as
	follows:

	```{.python .input n=1}
	import mxnet as mx
	from mxnet import np, npx
	from mxnet.gluon import nn, Block, Parameter, Constant


	x = np.random.uniform(size=(2, 20))

	net = nn.Sequential()
	net.add(nn.Dense(256, activation='relu'))
	net.add(nn.Dense(10))
	net.initialize()
	net(x)
	```

	This generates a network with a hidden layer of $256$ units, followed by a ReLU
	activation and another $10$ units governing the output. In particular, we used
	the [nn.Sequential](../../../../api/gluon/nn/index.rst#mxnet.gluon.nn.Sequential)
	constructor to generate an empty network into which we then inserted both
	layers. What exactly happens inside `nn.Sequential`
	has remained rather mysterious so far. In the following we will see that this
	really just constructs a block that is a container for other blocks. These
	blocks can be combined into larger artifacts, often recursively. The diagram
	below shows how:

	![Blocks can be used recursively to form larger artifacts](/_static/blocks.svg)

	In the following we will explain the various steps needed to go from defining
	layers to defining blocks (of one or more layers):

	1. Blocks take data as input.
	1. Blocks store state in the form of parameters that are inherent to the block.
	For instance, the block above contains two hidden layers, and we need a
	place to store parameters for it.
	1. Blocks produce meaningful output. This is typically encoded in what
	we will call the `forward` function. It allows us to invoke a block via
	`net(X)` to obtain the desired output. What happens behind the scenes is
	that it invokes `forward` to perform forward propagation (also called
	forward computation).
	1. Blocks initialize the parameters in a lazy fashion as part of the first
	`forward` call.
	1. Blocks calculate a gradient with regard to their input when invoking
	`backward`. Typically this is automatic.

	## A Sequential Block

	The [Block](../../../../api/gluon/block.rst#mxnet.gluon.Block) class is a
	generic component describing data flow. When the data flows through a sequence
	of blocks, each block applied to the output of the one before with the first
	block being applied on the input data itself, we have a special kind of block,
	namely the `Sequential` block.

	`Sequential` has helper methods to manage the sequence, with `add` being the
	main one of interest allowing you to append blocks in sequence. Once the
	operations have been added, the forward computation of the model applies the
	blocks on the input data in the order they were added. Below, we implement a
	`MySequential` class that has the same functionality as the `Sequential` class.
	This may help you understand more clearly how the `Sequential` class works.

	```{.python .input n=3}
	class MySequential(Block):
	def __init__(self):
	super(MySequential, self).__init__()
	self._layers = []

	def add(self, block):
	# Here, block is an instance of a Block subclass, and we assume it has a unique name. We save it in the
	# member variable _layers of the Block class, and its type is List. When the MySequential instance
	# calls the initialize function, the system automatically initializes all members of _layers.
	self._layers.append(block)
	self.register_child(block)

	def forward(self, x):
	# OrderedDict guarantees that members will be traversed in the order they were added.
	for block in self._children.values():
	x = block()(x)
	return x
	```

	At its core is the `add` method. It adds any block to the ordered dictionary of
	children. These are then executed in sequence when forward propagation is
	invoked. Let's see what the MLP looks like now.

	```{.python .input n=4}
	net = MySequential()
	net.add(nn.Dense(256, activation='relu'))
	net.add(nn.Dense(10))
	net.initialize()
	net(x)
	```

	Indeed, it is no different than It can observed here that the use of the
	`MySequential` class is no different from the use of the Sequential class.


	## A Custom Block

	It is easy to go beyond simple concatenation with `Sequential`. The
	`Block` class provides the functionality required to make such customizations.
	`Block` has a model constructor provided in the `nn` module, which we can
	inherit to define the model we want. The following inherits the `Block` class to
	construct the multilayer perceptron mentioned at the beginning of this section.
	The `MLP` class defined here overrides the `__init__` and `forward` functions
	of the Block class. They are used to create model parameters and define forward
	computations, respectively. Forward computation is also forward propagation.

	```{.python .input n=1}
	class MLP(nn.Block):
	# Declare a layer with model parameters. Here, we declare two fully
	# connected layers.

	def __init__(self, **kwargs):
	# Call the constructor of the MLP parent class Block to perform the
	# necessary initialization. In this way, other function parameters can
	# also be specified when constructing an instance, such as the model
	# parameter, params, described in the following sections.
	super(MLP, self).__init__(**kwargs)
	self.hidden = nn.Dense(256, activation='relu') # Hidden layer
	self.output = nn.Dense(10) # Output layer

	# Define the forward computation of the model, that is, how to return the
	# required model output based on the input x.

	def forward(self, x):
	hidden_out = self.hidden(x)
	return self.output(hidden_out)
	```

	Let's look at it a bit more closely. The `forward` method invokes a network
	simply by evaluating the hidden layer `self.hidden(x)` and subsequently by
	evaluating the output layer `self.output( ... )`. This is what we expect in the
	forward pass of this block.

	In order for the block to know what it needs to evaluate, we first need to
	define the layers. This is what the `__init__` method does. It first
	initializes all of the Block-related parameters and then constructs the
	requisite layers. This attaches the coresponding layers and the required
	parameters to the class. Note that there is no need to define a backpropagation
	method in the class. The system automatically generates the `backward` method
	needed for back propagation by automatically finding the gradient (see the tutorial
	on [autograd](../../autograd/index.ipynb)). The same applies to the
	[initialize](../../../../api/gluon/nn/index.rst#mxnet.gluon.nn.Block.initialize)
	method, which is generated automatically. Let's try
	this out:

	```{.python .input n=2}
	net = MLP()
	net.initialize()
	net(x)
	```

	As explained above, the `Block` class can be quite versatile in terms of what it
	does. For instance, its subclass can be a layer (such as the `Dense` class
	provided by Gluon), it can be a model (such as the `MLP` class we just derived),
	or it can be a part of a model (this is what typically happens when designing
	very deep networks). Throughout this chapter we will see how to use this with
	great flexibility.


	## Coding with `Blocks`

	### Blocks
	The [Sequential](../../../../api/gluon/nn/index.rst#mxnet.gluon.nn.Sequential) class
	can make model construction easier and does not require you to define the
	`forward` method; however, directly inheriting from
	its parent class, [Block](../../../../api/gluon/block.rst#mxnet.gluon.Block), can greatly
	expand the flexibility of model construction. For example, implementing the
	`forward` method means you can introduce control flow in the network.

	### Constant parameters
	Now we'd like to introduce the notation of a constant parameter. These are
	parameters that are not used when invoking backpropagation. This sounds very
	abstract but here's what's really going on.
	Assume that we have some function

	$$f(\mathbf{x},\mathbf{w}) = 3 \cdot \mathbf{w}^\top \mathbf{x}.$$

	In this case $3$ is a constant parameter. We could change $3$ to something else,
	say $c$ via

	$$f(\mathbf{x},\mathbf{w}) = c \cdot \mathbf{w}^\top \mathbf{x}.$$

	Nothing has really changed, except that we can adjust the value of $c$. It is
	still a constant as far as $\mathbf{w}$ and $\mathbf{x}$ are concerned. However,
	Gluon doesn't know about this unless we create it with `get_constant`
	(this makes the code go faster, too, since we're not sending the Gluon engine
	on a wild goose chase after a parameter that doesn't change).

	```{.python .input n=5}
	class FancyMLP(nn.Block):
	def __init__(self, **kwargs):
	super(FancyMLP, self).__init__(**kwargs)

	# Random weight parameters created with the get_constant are not
	# iterated during training (i.e. constant parameters).
	self.rand_weight = Constant(np.random.uniform(size=(20, 20)))
	self.dense = nn.Dense(20, activation='relu')

	def forward(self, x):
	x = self.dense(x)
	# Use the constant parameters created, as well as the ReLU and dot
	# functions of NDArray.

	x = npx.relu(np.dot(x, self.rand_weight.data()) + 1)
	# Re-use the fully connected layer. This is equivalent to sharing
	# parameters with two fully connected layers.
	x = self.dense(x)
	# Here in the control flow, we need to call `item` to return the
	# scalar for comparison.

	while npx.norm(x).item() > 1:
	x /= 2
	if npx.norm(x).item() < 0.8:
	x *= 10
	return x.sum()
	```

	In this `FancyMLP` model, we used constant weight `rand_weight` (note that it is
	not a model parameter), performed a matrix multiplication operation (`nd.dot`),
	and reused the same `Dense` layer. Note that this is very different from using
	two dense layers with different sets of parameters. Instead, we used the same
	network twice. Quite often in deep networks one also says that the parameters
	are tied when one wants to express that multiple parts of a network share the
	same parameters. Let's see what happens if we construct it and feed data through
	it.

	```{.python .input n=6}
	net = FancyMLP()
	net.initialize()
	net(x)
	```

	There's no reason why we couldn't mix and match these ways of building a
	network. Obviously the example below resembles a [Rube Goldberg
	Machine](https://en.wikipedia.org/wiki/Rube_Goldberg_machine). That said, it
	combines examples for building a block from individual blocks,
	which in turn, may be blocks themselves. Furthermore, we can even combine
	multiple strategies inside the same forward function. To demonstrate this,
	here's the network.

	```{.python .input n=7}
	class NestMLP(nn.Block):
	def __init__(self, **kwargs):
	super(NestMLP, self).__init__(**kwargs)
	self.net = nn.Sequential()
	self.net.add(nn.Dense(64, activation='relu'),
	nn.Dense(32, activation='relu'))
	self.dense = nn.Dense(16, activation='relu')

	def forward(self, x):
	return self.dense(self.net(x))

	chimera = nn.Sequential()
	chimera.add(NestMLP(), nn.Dense(20), FancyMLP())

	chimera.initialize()
	chimera(x)
	```

	## Hybridization

	The reader may be starting to think about the efficiency of this Python code.
	After all, we have lots of dictionary lookups, code execution, and lots of
	other Pythonic things going on in what is supposed to be a high performance
	deep learning library. The problems of Python's [Global Interpreter
	Lock](https://wiki.python.org/moin/GlobalInterpreterLock) are well
	known.

	In the device of deep learning, we often have highly performant GPUs that
	depend on CPUs running Python to tell them what to do. This mismatch can
	manifest in the form of GPU starvation when the CPUs can not provide
	instruction fast enough. We can improve this situation by deferring to a more
	performant language instead of Python when possible.

	Gluon does this by allowing for [Hybridization](hybridize.ipynb). In it, the
	Python interpreter executes the block the first time it's invoked. The Gluon
	runtime records what is happening and the next time around it short circuits
	any calls to Python. This can accelerate things considerably in some cases but
	care needs to be taken with [control flow](../../autograd/index.ipynb#Advanced:-Using-Python-control-flow).