blob: b0f09b0b4187999de13f2bfa2e615da95349fe5a [file] [log] [blame]
{
"cells": [
{
"cell_type": "markdown",
"id": "73900e14",
"metadata": {},
"source": [
"<!--- Licensed to the Apache Software Foundation (ASF) under one -->\n",
"<!--- or more contributor license agreements. See the NOTICE file -->\n",
"<!--- distributed with this work for additional information -->\n",
"<!--- regarding copyright ownership. The ASF licenses this file -->\n",
"<!--- to you under the Apache License, Version 2.0 (the -->\n",
"<!--- \"License\"); you may not use this file except in compliance -->\n",
"<!--- with the License. You may obtain a copy of the License at -->\n",
"\n",
"<!--- http://www.apache.org/licenses/LICENSE-2.0 -->\n",
"\n",
"<!--- Unless required by applicable law or agreed to in writing, -->\n",
"<!--- software distributed under the License is distributed on an -->\n",
"<!--- \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->\n",
"<!--- KIND, either express or implied. See the License for the -->\n",
"<!--- specific language governing permissions and limitations -->\n",
"<!--- under the License. -->\n",
"\n",
"# Initialization\n",
"\n",
"<!-- adapted from diveintodeeplearning -->\n",
"\n",
"In the [Neural Networks](nn.html) section we played fast and loose with setting\n",
"up our networks. In particular we did the following things that *shouldn't*\n",
"work:\n",
"\n",
"* We defined the network architecture with no regard to the input\n",
" dimensionality.\n",
"* We added layers without regard to the output dimension of the previous layer.\n",
"* We even 'initialized' these parameters without knowing how many parameters\n",
" we were going to initialize.\n",
"\n",
"All of those things sound impossible and indeed, they are. After all, there's\n",
"no way MXNet (or any other framework for that matter) could predict what the\n",
"input dimensionality of a network would be. Later on, when working with\n",
"convolutional networks and images this problem will become even more pertinent,\n",
"since the input dimensionality (i.e. the resolution of an image) will affect\n",
"the dimensionality of subsequent layers. The ability to\n",
"determine parameter dimensionality during run-time rather than at coding time\n",
"greatly simplifies the process of doing deep learning.\n",
"\n",
"## Instantiating a Network\n",
"\n",
"Let's see what happens when we instantiate a network. We start by defining a multi-layer perceptron."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7aba4d67",
"metadata": {},
"outputs": [],
"source": [
"from mxnet import init, nd\n",
"from mxnet.gluon import nn\n",
"\n",
"\n",
"def getnet():\n",
" net = nn.Sequential()\n",
" net.add(nn.Dense(256, activation='relu'))\n",
" net.add(nn.Dense(10))\n",
" return net\n",
"\n",
"net = getnet()"
]
},
{
"cell_type": "markdown",
"id": "bf783219",
"metadata": {},
"source": [
"At this point the network doesn't really know yet what the dimensionalities of\n",
"the various parameters should be. All one could tell at this point is that each\n",
"layer needs weights and bias, albeit of unspecified dimensionality. If we try\n",
"accessing the parameters, that's exactly what happens."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2da27992",
"metadata": {},
"outputs": [],
"source": [
"print(net.collect_params())"
]
},
{
"cell_type": "markdown",
"id": "3c21a00b",
"metadata": {},
"source": [
"You'll notice `None` here in each `Dense` layer. This absence of value is how\n",
"MXNet keeps track of unspecified dimensionality. In particular, trying to access\n",
"`net[0].weight.data()` at this point would trigger a runtime error stating that\n",
"the network needs initializing before it can do anything.\n",
"\n",
"Note that if we did want to specify dimensionality, we could have done so by\n",
"using the kwarg `in_units`, e.g. `Dense(256, activiation='relu', in_units=20)`.\n",
"\n",
"Let's see whether anything changes after we initialize the parameters:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7a935285",
"metadata": {},
"outputs": [],
"source": [
"net.initialize()\n",
"net.collect_params()"
]
},
{
"cell_type": "markdown",
"id": "ed9e2dd9",
"metadata": {},
"source": [
"As we can see, nothing really changed. Only once we provide the network with\n",
"some data do we see a difference. Let's try it out."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "75d398f7",
"metadata": {},
"outputs": [],
"source": [
"x = nd.random.uniform(shape=(2, 20))\n",
"net(x) # Forward computation\n",
"print(net.collect_params())"
]
},
{
"cell_type": "markdown",
"id": "f2c8c693",
"metadata": {},
"source": [
"We see all the dimensions have been determined and the parameters initialized.\n",
"This is because shape inference and parameter initialization have been\n",
"performed in a lazy manner, so they are performed only when needed. In the\n",
"above case, they are performed as a prerequisite to the forward computation.\n",
"\n",
"Dimensional inference works like this: as soon as we knew the input\n",
"dimensionality, $\\mathbf{x} \\in \\mathbb{R}^{20}$ it was possible to define the\n",
"weight matrix for the first layer, i.e. $\\mathbf{W}_1 \\in \\mathbb{R}^{256 \\times\n",
"20}$. With that out of the way, we can progress to the second layer, define its\n",
"dimensionality to be $10 \\times 256$ and so on through the computational graph\n",
"and resolve all the dimensions as they become available. Once this is known, we\n",
"can proceed by initializing parameters. This is the solution to the three\n",
"problems outlined above.\n",
"\n",
"\n",
"## Deferred Initialization in Practice\n",
"\n",
"Now that we know how it works in theory, let's see when the initialization is\n",
"actually triggered. In order to do so, we mock up an initializer which does\n",
"nothing but report a debug message stating when it was invoked and with which\n",
"parameters."
]
},
{
"cell_type": "code",
"execution_count": 22,
"id": "2c9ae1a2",
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "22"
}
},
"outputs": [],
"source": [
"class PrintInit(init.Initializer):\n",
" def _init_weight(self, name, data):\n",
" print('Init', name, data.shape)\n",
" # The actual initialization logic is omitted here.\n",
"\n",
"net = getnet()\n",
"net.initialize(init=PrintInit())"
]
},
{
"cell_type": "markdown",
"id": "c4259177",
"metadata": {},
"source": [
"Note that, although `MyInit` will print information about the model parameters\n",
"when it is called, the above `initialize` function does not print any\n",
"information after it has been executed. Therefore there is no actual\n",
"initialization when calling the `initialize` function - this\n",
"+initialization is deferred until forward is called for the first time. Next,\n",
"we define the input and perform a forward calculation."
]
},
{
"cell_type": "code",
"execution_count": 25,
"id": "6a11a34c",
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "25"
}
},
"outputs": [],
"source": [
"x = nd.random.uniform(shape=(2, 20))\n",
"y = net(x)"
]
},
{
"cell_type": "markdown",
"id": "1bdafc56",
"metadata": {},
"source": [
"At this time, information on the model parameters is printed. When performing a\n",
"forward calculation based on the input `x`, the system can automatically infer\n",
"the shape of the weight parameters of all layers based on the shape of the\n",
"input. Once the system has created these parameters, it calls the `MyInit`\n",
"instance to initialize them before proceeding to the forward calculation.\n",
"\n",
"Of course, this initialization will only be called when completing the initial\n",
"forward calculation. After that, we will not re-initialize when we run the\n",
"forward calculation `net(x)`, so the output of the `MyInit` instance will not be\n",
"generated again."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d3f119c9",
"metadata": {},
"outputs": [],
"source": [
"y = net(x)"
]
},
{
"cell_type": "markdown",
"id": "3fcada41",
"metadata": {},
"source": [
"As mentioned at the beginning of this section, deferred initialization can also\n",
"cause confusion. Before the first forward calculation, we were unable to\n",
"directly manipulate the model parameters, for example, we could not use the\n",
"`data` and `set_data` functions to get and modify the parameters. Therefore, we\n",
"often force initialization by sending a sample observation through the network.\n",
"\n",
"## Forced Initialization\n",
"\n",
"Deferred initialization does not occur if the system knows the shape of all\n",
"parameters when calling the `initialize` function. This can occur in two cases:\n",
"\n",
"* We've already seen some data and we just want to reset the parameters.\n",
"* We specified all input and output dimensions of the network or layer when\n",
" defining it.\n",
"\n",
"The first case works just fine, as illustrated below."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "18fc53f5",
"metadata": {},
"outputs": [],
"source": [
"net.initialize(init=MyInit(), force_reinit=True)"
]
},
{
"cell_type": "markdown",
"id": "831f6ae1",
"metadata": {},
"source": [
"The second case requires us to specify the remaining set of parameters when\n",
"creating the layer. For instance, for dense layers we also need to specify the\n",
"`in_units` so that initialization can occur immediately once `initialize` is\n",
"called."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5663885a",
"metadata": {},
"outputs": [],
"source": [
"net = nn.Sequential()\n",
"net.add(nn.Dense(256, in_units=20, activation='relu'))\n",
"net.add(nn.Dense(10, in_units=256))\n",
"\n",
"net.initialize(init=MyInit())"
]
},
{
"cell_type": "markdown",
"id": "faad3e74",
"metadata": {},
"source": [
"## Parameter Initialization\n",
"\n",
"By default, MXNet initializes the weight matrices uniformly by drawing random\n",
"values with uniform-distribution between $-0.07$ and $0.07$ ($U[-0.07, 0.07]$)\n",
"and updates the bias parameters by setting them all to $0$. However, we often\n",
"need to use other methods to initialize the weights. MXNet's `init` module\n",
"provides a variety of preset initialization methods, but if we want something\n",
"out of the ordinary, we need a bit of extra work.\n",
"\n",
"### Built-in Initialization\n",
"\n",
"Let's begin with the built-in initializers. The code below initializes all\n",
"parameters with Gaussian random variables."
]
},
{
"cell_type": "code",
"execution_count": 9,
"id": "afbbd5f8",
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "9"
}
},
"outputs": [],
"source": [
"# force_reinit ensures that the variables are initialized again, regardless of\n",
"# whether they were already initialized previously.\n",
"net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)\n",
"print(net[0].weight.data()[0])"
]
},
{
"cell_type": "markdown",
"id": "7f7cebc5",
"metadata": {},
"source": [
"If we wanted to initialize all parameters to $1$, we could do this simply by\n",
"changing the initializer to `Constant(1)`."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "a9b0c842",
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "10"
}
},
"outputs": [],
"source": [
"net.initialize(init=init.Constant(1), force_reinit=True)\n",
"net[0].weight.data()[0]"
]
},
{
"cell_type": "markdown",
"id": "aebf603a",
"metadata": {},
"source": [
"If we want to initialize only a specific parameter in a different manner, we\n",
"can simply set the initializer only for the appropriate subblock (or\n",
"parameter). For instance, below we initialize the second layer to a constant\n",
"value of $42$ and we use the `Xavier` initializer for the weights of the\n",
"first layer."
]
},
{
"cell_type": "code",
"execution_count": 11,
"id": "2209624b",
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "11"
}
},
"outputs": [],
"source": [
"net[0].weight.initialize(init=init.Xavier(), force_reinit=True)\n",
"net[1].initialize(init=init.Constant(42), force_reinit=True)\n",
"\n",
"# First layer\n",
"print(net[0].weight.data()[0])\n",
"print(net[0].bias.data()[0]) # initialized to 0\n",
"\n",
"# Second layer\n",
"print(net[1].weight.data()[0,0])\n",
"print(net[1].bias.data()[0]) # initialized to 0"
]
},
{
"cell_type": "markdown",
"id": "82d9d8a5",
"metadata": {},
"source": [
"### Custom Initialization\n",
"\n",
"Sometimes, the initialization methods we need are not provided in the `init`\n",
"module. At this point, we can implement a subclass of the `Initializer` class\n",
"so that we can use it like any other initialization method. Usually, we only\n",
"need to implement the `_init_weight` function to suit our needs. In the example\n",
"below, we pick a decidedly bizarre and nontrivial distribution, just to prove\n",
"the point. We draw the coefficients from the following distribution:\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
" w \\sim \\begin{cases}\n",
" U[5, 10] & \\text{ with probability } \\frac{1}{4} \\\\\n",
" 0 & \\text{ with probability } \\frac{1}{2} \\\\\n",
" U[-10, -5] & \\text{ with probability } \\frac{1}{4}\n",
" \\end{cases}\n",
"\\end{aligned}\n",
"$$"
]
},
{
"cell_type": "code",
"execution_count": 12,
"id": "7609305d",
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "12"
}
},
"outputs": [],
"source": [
"class MyInit(init.Initializer):\n",
" def _init_weight(self, name, data):\n",
" print('Init', name, data.shape)\n",
" data[:] = nd.random.uniform(low=-10, high=10, shape=data.shape)\n",
" data *= data.abs() >= 5\n",
"\n",
"net.initialize(MyInit(), force_reinit=True)\n",
"net[0].weight.data()[0]"
]
},
{
"cell_type": "markdown",
"id": "720c7d8b",
"metadata": {},
"source": [
"If this functionality is insufficient, we can even set parameters directly.\n",
"Since `data()` returns an `NDArray` we can access it just like any other matrix.\n",
"A note for advanced users - if you want to adjust parameters within an\n",
"`autograd` scope you need to use `set_data` to avoid confusing the automatic\n",
"differentiation mechanics."
]
},
{
"cell_type": "code",
"execution_count": 13,
"id": "ca608aea",
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "13"
}
},
"outputs": [],
"source": [
"net[0].weight.data()[:] += 1\n",
"net[0].weight.data()[0,0] = 42\n",
"net[0].weight.data()[0]"
]
},
{
"cell_type": "markdown",
"id": "f881c7ea",
"metadata": {},
"source": [
"## Tied Parameters\n",
"\n",
"In some cases, we want to share model parameters across multiple layers. For\n",
"instance when we want to find good word embeddings we may decide to use the\n",
"same parameters both for encoding and decoding of words. Let's see how to do\n",
"this a bit more elegantly. In the following we construct a dense layer and then\n",
"use its parameters specifically to set those of another layer."
]
},
{
"cell_type": "code",
"execution_count": 14,
"id": "29540089",
"metadata": {
"attributes": {
"classes": [],
"id": "",
"n": "14"
}
},
"outputs": [],
"source": [
"net = nn.Sequential()\n",
"# We need to give the shared layer a name such that we can reference its\n",
"# parameters.\n",
"shared = nn.Dense(8, activation='relu')\n",
"net.add(nn.Dense(8, activation='relu'),\n",
" shared,\n",
" nn.Dense(8, activation='relu', params=shared.params),\n",
" nn.Dense(10))\n",
"net.initialize()\n",
"\n",
"x = nd.random.uniform(shape=(2, 20))\n",
"net(x)\n",
"\n",
"# Check whether the parameters are the same.\n",
"print(net[1].weight.data()[0] == net[2].weight.data()[0])\n",
"net[1].weight.data()[0,0] = 100\n",
"# And make sure that they're actually the same object rather than just having\n",
"# the same value.\n",
"print(net[1].weight.data()[0] == net[2].weight.data()[0])"
]
},
{
"cell_type": "markdown",
"id": "0e4153ff",
"metadata": {},
"source": [
"The above example shows that the parameters of the second and third layer are\n",
"tied. As Python objects, they are identical rather than just being equal.\n",
"That is, by changing one of the parameters the other one changes too. What\n",
"happens to the gradients is quite ingenious. Since the model parameters contain\n",
"gradients, the gradients of the second hidden layer and the third hidden layer\n",
"are accumulated in `shared.params.grad` during backpropagation.\n",
"\n",
"## Conclusion\n",
"\n",
"In this tutorial you learnt how to initialize a neural network, and should now\n",
"understand the difference between deferred and forced initialization. Some more advanced\n",
"cases you should now be aware of include custom initialization and tied parameters.\n",
"\n",
"## Recommended Next Steps\n",
"\n",
"* Check out the [API Docs](/api/python/docs/api/optimizer/index.html) on initialization for a list of available initialization methods.\n",
"* See [this tutorial](/api/python/docs/tutorials/packages/gluon/blocks/naming.html) for more information on Gluon Parameters."
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 5
}