| { |
| "cells": [ |
| { |
| "cell_type": "markdown", |
| "id": "67f1b310", |
| "metadata": {}, |
| "source": [ |
| "<!--- Licensed to the Apache Software Foundation (ASF) under one -->\n", |
| "<!--- or more contributor license agreements. See the NOTICE file -->\n", |
| "<!--- distributed with this work for additional information -->\n", |
| "<!--- regarding copyright ownership. The ASF licenses this file -->\n", |
| "<!--- to you under the Apache License, Version 2.0 (the -->\n", |
| "<!--- \"License\"); you may not use this file except in compliance -->\n", |
| "<!--- with the License. You may obtain a copy of the License at -->\n", |
| "\n", |
| "<!--- http://www.apache.org/licenses/LICENSE-2.0 -->\n", |
| "\n", |
| "<!--- Unless required by applicable law or agreed to in writing, -->\n", |
| "<!--- software distributed under the License is distributed on an -->\n", |
| "<!--- \"AS IS\" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->\n", |
| "<!--- KIND, either express or implied. See the License for the -->\n", |
| "<!--- specific language governing permissions and limitations -->\n", |
| "<!--- under the License. -->\n", |
| "\n", |
| "# Initialization\n", |
| "\n", |
| "<!-- adapted from diveintodeeplearning -->\n", |
| "\n", |
| "In the [Neural Networks](nn.html) section we played fast and loose with setting\n", |
| "up our networks. In particular we did the following things that *shouldn't*\n", |
| "work:\n", |
| "\n", |
| "* We defined the network architecture with no regard to the input\n", |
| " dimensionality.\n", |
| "* We added layers without regard to the output dimension of the previous layer.\n", |
| "* We even 'initialized' these parameters without knowing how many parameters\n", |
| " we were going to initialize.\n", |
| "\n", |
| "All of those things sound impossible and indeed, they are. After all, there's\n", |
| "no way MXNet (or any other framework for that matter) could predict what the\n", |
| "input dimensionality of a network would be. Later on, when working with\n", |
| "convolutional networks and images this problem will become even more pertinent,\n", |
| "since the input dimensionality (i.e. the resolution of an image) will affect\n", |
| "the dimensionality of subsequent layers. The ability to\n", |
| "determine parameter dimensionality during run-time rather than at coding time\n", |
| "greatly simplifies the process of doing deep learning.\n", |
| "\n", |
| "## Instantiating a Network\n", |
| "\n", |
| "Let's see what happens when we instantiate a network. We start by defining a multi-layer perceptron." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "id": "815fc952", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "from mxnet import init, nd\n", |
| "from mxnet.gluon import nn\n", |
| "\n", |
| "\n", |
| "def getnet():\n", |
| " net = nn.Sequential()\n", |
| " net.add(nn.Dense(256, activation='relu'))\n", |
| " net.add(nn.Dense(10))\n", |
| " return net\n", |
| "\n", |
| "net = getnet()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "563c33b5", |
| "metadata": {}, |
| "source": [ |
| "At this point the network doesn't really know yet what the dimensionalities of\n", |
| "the various parameters should be. All one could tell at this point is that each\n", |
| "layer needs weights and bias, albeit of unspecified dimensionality. If we try\n", |
| "accessing the parameters, that's exactly what happens." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "id": "0d93cf2e", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "print(net.collect_params())" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "ca7f22f0", |
| "metadata": {}, |
| "source": [ |
| "You'll notice `None` here in each `Dense` layer. This absence of value is how\n", |
| "MXNet keeps track of unspecified dimensionality. In particular, trying to access\n", |
| "`net[0].weight.data()` at this point would trigger a runtime error stating that\n", |
| "the network needs initializing before it can do anything.\n", |
| "\n", |
| "Note that if we did want to specify dimensionality, we could have done so by\n", |
| "using the kwarg `in_units`, e.g. `Dense(256, activiation='relu', in_units=20)`.\n", |
| "\n", |
| "Let's see whether anything changes after we initialize the parameters:" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "id": "b22eff22", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "net.initialize()\n", |
| "net.collect_params()" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "24776588", |
| "metadata": {}, |
| "source": [ |
| "As we can see, nothing really changed. Only once we provide the network with\n", |
| "some data do we see a difference. Let's try it out." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "id": "3a2dd878", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "x = nd.random.uniform(shape=(2, 20))\n", |
| "net(x) # Forward computation\n", |
| "print(net.collect_params())" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "ea6c77b3", |
| "metadata": {}, |
| "source": [ |
| "We see all the dimensions have been determined and the parameters initialized.\n", |
| "This is because shape inference and parameter initialization have been\n", |
| "performed in a lazy manner, so they are performed only when needed. In the\n", |
| "above case, they are performed as a prerequisite to the forward computation.\n", |
| "\n", |
| "Dimensional inference works like this: as soon as we knew the input\n", |
| "dimensionality, $\\mathbf{x} \\in \\mathbb{R}^{20}$ it was possible to define the\n", |
| "weight matrix for the first layer, i.e. $\\mathbf{W}_1 \\in \\mathbb{R}^{256 \\times\n", |
| "20}$. With that out of the way, we can progress to the second layer, define its\n", |
| "dimensionality to be $10 \\times 256$ and so on through the computational graph\n", |
| "and resolve all the dimensions as they become available. Once this is known, we\n", |
| "can proceed by initializing parameters. This is the solution to the three\n", |
| "problems outlined above.\n", |
| "\n", |
| "\n", |
| "## Deferred Initialization in Practice\n", |
| "\n", |
| "Now that we know how it works in theory, let's see when the initialization is\n", |
| "actually triggered. In order to do so, we mock up an initializer which does\n", |
| "nothing but report a debug message stating when it was invoked and with which\n", |
| "parameters." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 22, |
| "id": "84f7e5d9", |
| "metadata": { |
| "attributes": { |
| "classes": [], |
| "id": "", |
| "n": "22" |
| } |
| }, |
| "outputs": [], |
| "source": [ |
| "class PrintInit(init.Initializer):\n", |
| " def _init_weight(self, name, data):\n", |
| " print('Init', name, data.shape)\n", |
| " # The actual initialization logic is omitted here.\n", |
| "\n", |
| "net = getnet()\n", |
| "net.initialize(init=PrintInit())" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "168b2eb9", |
| "metadata": {}, |
| "source": [ |
| "Note that, although `MyInit` will print information about the model parameters\n", |
| "when it is called, the above `initialize` function does not print any\n", |
| "information after it has been executed. Therefore there is no actual\n", |
| "initialization when calling the `initialize` function - this\n", |
| "+initialization is deferred until forward is called for the first time. Next,\n", |
| "we define the input and perform a forward calculation." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 25, |
| "id": "8b423247", |
| "metadata": { |
| "attributes": { |
| "classes": [], |
| "id": "", |
| "n": "25" |
| } |
| }, |
| "outputs": [], |
| "source": [ |
| "x = nd.random.uniform(shape=(2, 20))\n", |
| "y = net(x)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "68f7d501", |
| "metadata": {}, |
| "source": [ |
| "At this time, information on the model parameters is printed. When performing a\n", |
| "forward calculation based on the input `x`, the system can automatically infer\n", |
| "the shape of the weight parameters of all layers based on the shape of the\n", |
| "input. Once the system has created these parameters, it calls the `MyInit`\n", |
| "instance to initialize them before proceeding to the forward calculation.\n", |
| "\n", |
| "Of course, this initialization will only be called when completing the initial\n", |
| "forward calculation. After that, we will not re-initialize when we run the\n", |
| "forward calculation `net(x)`, so the output of the `MyInit` instance will not be\n", |
| "generated again." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "id": "47cf41a0", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "y = net(x)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "2c9d5675", |
| "metadata": {}, |
| "source": [ |
| "As mentioned at the beginning of this section, deferred initialization can also\n", |
| "cause confusion. Before the first forward calculation, we were unable to\n", |
| "directly manipulate the model parameters, for example, we could not use the\n", |
| "`data` and `set_data` functions to get and modify the parameters. Therefore, we\n", |
| "often force initialization by sending a sample observation through the network.\n", |
| "\n", |
| "## Forced Initialization\n", |
| "\n", |
| "Deferred initialization does not occur if the system knows the shape of all\n", |
| "parameters when calling the `initialize` function. This can occur in two cases:\n", |
| "\n", |
| "* We've already seen some data and we just want to reset the parameters.\n", |
| "* We specified all input and output dimensions of the network or layer when\n", |
| " defining it.\n", |
| "\n", |
| "The first case works just fine, as illustrated below." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "id": "938eeef7", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "net.initialize(init=MyInit(), force_reinit=True)" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "74c209ae", |
| "metadata": {}, |
| "source": [ |
| "The second case requires us to specify the remaining set of parameters when\n", |
| "creating the layer. For instance, for dense layers we also need to specify the\n", |
| "`in_units` so that initialization can occur immediately once `initialize` is\n", |
| "called." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": null, |
| "id": "d6cc897c", |
| "metadata": {}, |
| "outputs": [], |
| "source": [ |
| "net = nn.Sequential()\n", |
| "net.add(nn.Dense(256, in_units=20, activation='relu'))\n", |
| "net.add(nn.Dense(10, in_units=256))\n", |
| "\n", |
| "net.initialize(init=MyInit())" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "84fe964c", |
| "metadata": {}, |
| "source": [ |
| "## Parameter Initialization\n", |
| "\n", |
| "By default, MXNet initializes the weight matrices uniformly by drawing random\n", |
| "values with uniform-distribution between $-0.07$ and $0.07$ ($U[-0.07, 0.07]$)\n", |
| "and updates the bias parameters by setting them all to $0$. However, we often\n", |
| "need to use other methods to initialize the weights. MXNet's `init` module\n", |
| "provides a variety of preset initialization methods, but if we want something\n", |
| "out of the ordinary, we need a bit of extra work.\n", |
| "\n", |
| "### Built-in Initialization\n", |
| "\n", |
| "Let's begin with the built-in initializers. The code below initializes all\n", |
| "parameters with Gaussian random variables." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 9, |
| "id": "49fe2901", |
| "metadata": { |
| "attributes": { |
| "classes": [], |
| "id": "", |
| "n": "9" |
| } |
| }, |
| "outputs": [], |
| "source": [ |
| "# force_reinit ensures that the variables are initialized again, regardless of\n", |
| "# whether they were already initialized previously.\n", |
| "net.initialize(init=init.Normal(sigma=0.01), force_reinit=True)\n", |
| "print(net[0].weight.data()[0])" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "a9bce50a", |
| "metadata": {}, |
| "source": [ |
| "If we wanted to initialize all parameters to $1$, we could do this simply by\n", |
| "changing the initializer to `Constant(1)`." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 10, |
| "id": "6f37d975", |
| "metadata": { |
| "attributes": { |
| "classes": [], |
| "id": "", |
| "n": "10" |
| } |
| }, |
| "outputs": [], |
| "source": [ |
| "net.initialize(init=init.Constant(1), force_reinit=True)\n", |
| "net[0].weight.data()[0]" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "d098a652", |
| "metadata": {}, |
| "source": [ |
| "If we want to initialize only a specific parameter in a different manner, we\n", |
| "can simply set the initializer only for the appropriate subblock (or\n", |
| "parameter). For instance, below we initialize the second layer to a constant\n", |
| "value of $42$ and we use the `Xavier` initializer for the weights of the\n", |
| "first layer." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 11, |
| "id": "39c66793", |
| "metadata": { |
| "attributes": { |
| "classes": [], |
| "id": "", |
| "n": "11" |
| } |
| }, |
| "outputs": [], |
| "source": [ |
| "net[0].weight.initialize(init=init.Xavier(), force_reinit=True)\n", |
| "net[1].initialize(init=init.Constant(42), force_reinit=True)\n", |
| "\n", |
| "# First layer\n", |
| "print(net[0].weight.data()[0])\n", |
| "print(net[0].bias.data()[0]) # initialized to 0\n", |
| "\n", |
| "# Second layer\n", |
| "print(net[1].weight.data()[0,0])\n", |
| "print(net[1].bias.data()[0]) # initialized to 0" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "5996ff69", |
| "metadata": {}, |
| "source": [ |
| "### Custom Initialization\n", |
| "\n", |
| "Sometimes, the initialization methods we need are not provided in the `init`\n", |
| "module. At this point, we can implement a subclass of the `Initializer` class\n", |
| "so that we can use it like any other initialization method. Usually, we only\n", |
| "need to implement the `_init_weight` function to suit our needs. In the example\n", |
| "below, we pick a decidedly bizarre and nontrivial distribution, just to prove\n", |
| "the point. We draw the coefficients from the following distribution:\n", |
| "\n", |
| "$$\n", |
| "\\begin{aligned}\n", |
| " w \\sim \\begin{cases}\n", |
| " U[5, 10] & \\text{ with probability } \\frac{1}{4} \\\\\n", |
| " 0 & \\text{ with probability } \\frac{1}{2} \\\\\n", |
| " U[-10, -5] & \\text{ with probability } \\frac{1}{4}\n", |
| " \\end{cases}\n", |
| "\\end{aligned}\n", |
| "$$" |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 12, |
| "id": "18eaff11", |
| "metadata": { |
| "attributes": { |
| "classes": [], |
| "id": "", |
| "n": "12" |
| } |
| }, |
| "outputs": [], |
| "source": [ |
| "class MyInit(init.Initializer):\n", |
| " def _init_weight(self, name, data):\n", |
| " print('Init', name, data.shape)\n", |
| " data[:] = nd.random.uniform(low=-10, high=10, shape=data.shape)\n", |
| " data *= data.abs() >= 5\n", |
| "\n", |
| "net.initialize(MyInit(), force_reinit=True)\n", |
| "net[0].weight.data()[0]" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "a3b4db20", |
| "metadata": {}, |
| "source": [ |
| "If this functionality is insufficient, we can even set parameters directly.\n", |
| "Since `data()` returns an `NDArray` we can access it just like any other matrix.\n", |
| "A note for advanced users - if you want to adjust parameters within an\n", |
| "`autograd` scope you need to use `set_data` to avoid confusing the automatic\n", |
| "differentiation mechanics." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 13, |
| "id": "f855993a", |
| "metadata": { |
| "attributes": { |
| "classes": [], |
| "id": "", |
| "n": "13" |
| } |
| }, |
| "outputs": [], |
| "source": [ |
| "net[0].weight.data()[:] += 1\n", |
| "net[0].weight.data()[0,0] = 42\n", |
| "net[0].weight.data()[0]" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "f7d075ef", |
| "metadata": {}, |
| "source": [ |
| "## Tied Parameters\n", |
| "\n", |
| "In some cases, we want to share model parameters across multiple layers. For\n", |
| "instance when we want to find good word embeddings we may decide to use the\n", |
| "same parameters both for encoding and decoding of words. Let's see how to do\n", |
| "this a bit more elegantly. In the following we construct a dense layer and then\n", |
| "use its parameters specifically to set those of another layer." |
| ] |
| }, |
| { |
| "cell_type": "code", |
| "execution_count": 14, |
| "id": "ff71e69a", |
| "metadata": { |
| "attributes": { |
| "classes": [], |
| "id": "", |
| "n": "14" |
| } |
| }, |
| "outputs": [], |
| "source": [ |
| "net = nn.Sequential()\n", |
| "# We need to give the shared layer a name such that we can reference its\n", |
| "# parameters.\n", |
| "shared = nn.Dense(8, activation='relu')\n", |
| "net.add(nn.Dense(8, activation='relu'),\n", |
| " shared,\n", |
| " nn.Dense(8, activation='relu', params=shared.params),\n", |
| " nn.Dense(10))\n", |
| "net.initialize()\n", |
| "\n", |
| "x = nd.random.uniform(shape=(2, 20))\n", |
| "net(x)\n", |
| "\n", |
| "# Check whether the parameters are the same.\n", |
| "print(net[1].weight.data()[0] == net[2].weight.data()[0])\n", |
| "net[1].weight.data()[0,0] = 100\n", |
| "# And make sure that they're actually the same object rather than just having\n", |
| "# the same value.\n", |
| "print(net[1].weight.data()[0] == net[2].weight.data()[0])" |
| ] |
| }, |
| { |
| "cell_type": "markdown", |
| "id": "888a7536", |
| "metadata": {}, |
| "source": [ |
| "The above example shows that the parameters of the second and third layer are\n", |
| "tied. As Python objects, they are identical rather than just being equal.\n", |
| "That is, by changing one of the parameters the other one changes too. What\n", |
| "happens to the gradients is quite ingenious. Since the model parameters contain\n", |
| "gradients, the gradients of the second hidden layer and the third hidden layer\n", |
| "are accumulated in `shared.params.grad` during backpropagation.\n", |
| "\n", |
| "## Conclusion\n", |
| "\n", |
| "In this tutorial you learnt how to initialize a neural network, and should now\n", |
| "understand the difference between deferred and forced initialization. Some more advanced\n", |
| "cases you should now be aware of include custom initialization and tied parameters.\n", |
| "\n", |
| "## Recommended Next Steps\n", |
| "\n", |
| "* Check out the [API Docs](/api/python/docs/api/optimizer/index.html) on initialization for a list of available initialization methods.\n", |
| "* See [this tutorial](/api/python/docs/tutorials/packages/gluon/blocks/naming.html) for more information on Gluon Parameters." |
| ] |
| } |
| ], |
| "metadata": { |
| "language_info": { |
| "name": "python" |
| } |
| }, |
| "nbformat": 4, |
| "nbformat_minor": 5 |
| } |