Training a neural network model consists of iteratively performing three simple steps.
The first step is the forward step which computes the loss. In MXNet Gluon, this first step is achieved by doing a forward pass by calling net.forward(X) or simply net(X) and then calling the loss function with the result of the forward pass and the labels. For example l = loss_fn(net(X), y).
The second step is the backward step which computes the gradient of the loss with respect to the parameters. In Gluon, this step is achieved by doing the first step in an autograd.record() scope to record the computations needed to calculate the loss, and then calling l.backward() to compute the gradient of the loss with respect to the parameters.
The final step is to update the neural network model parameters using an optimization algorithm. In Gluon, this step is performed by the gluon.Trainer and is the subject of this guide. When creating a Gluon Trainer you must provide a collection of parameters that need to be learnt. You also provide an Optimizer that will be used to update the parameters every training iteration when trainer.step is called.
To illustrate how to use the Gluon Trainer we will create a simple perceptron model and create a Trainer instance using the perceptron model parameters and a simple optimizer - sgd with learning rate as 1.
from mxnet import np, autograd, optimizer, gluon net = gluon.nn.Dense(1) net.initialize() trainer = gluon.Trainer(net.collect_params(), optimizer='sgd', optimizer_params={'learning_rate':1})
Before we can use the trainer to update model parameters, we must first run the forward and backward passes. Here we implement a function to compute the first two steps (forward step and backward step) of training the perceptron on a random dataset.
batch_size = 8 X = np.random.uniform(size=(batch_size, 4)) y = np.random.uniform(size=(batch_size,)) loss = gluon.loss.L2Loss() def forward_backward(): with autograd.record(): l = loss(net(X), y) l.backward() forward_backward()
Warning: It is extremely important that the gradients of the loss function with respect to your model parameters are computed before running trainer step. A common way to introduce bugs to your model training code is to omit the loss.backward()before the update step.
Before updating, let's check the current network parameters.
curr_weight = net.weight.data().copy() print(curr_weight)
Trainer stepNow we will call the step method to perform one update. We provide the batch_size as an argument to normalize the size of the gradients and make it independent of the batch size. Otherwise we'd get larger gradients with larger batch sizes. We can see the network parameters have now changed.
trainer.step(batch_size) print(net.weight.data())
Since we used plain SGD, the update rule is $w = w - \eta/b \nabla \ell$, where $b$ is the batch size and $\nabla\ell$ is the gradient of the loss function with respect to the weights and $\eta$ is the learning rate.
We can verify it by running the following code snippet which is explicitly performing the SGD update.
print(curr_weight - net.weight.grad() * 1 / batch_size)
In the previous example, we use the string argument sgd to select the optimization method, and optimizer_params to specify the optimization method arguments.
All pre-defined optimization methods can be passed in this way and the complete list of implemented optimizers is provided in the mxnet.optimizer module.
However we can also pass an optimizer instance directly to the Trainer constructor.
For example:
optim = optimizer.Adam(learning_rate = 1) trainer = gluon.Trainer(net.collect_params(), optim)
forward_backward() trainer.step(batch_size) net.weight.data()
For reference and implementation details about each optimizer, please refer to the guide and API doc for the optimizer module.
The Trainer constructor also accepts the following keyword arguments for :
kvstore – how key value store should be created for multi-gpu and distributed training. Check out mxnet.kvstore.KVStore for more information. String options are any of the following [‘local’, ‘device’, ‘dist_device_sync’, ‘dist_device_async’].compression_params – Specifies type of gradient compression and additional arguments depending on the type of compression being used. See mxnet.KVStore.set_gradient_compression_method for more details on gradient compression.update_on_kvstore – Whether to perform parameter updates on KVStore. If None, then the Trainer instance will choose the more suitable option depending on the type of KVStore.We set the initial learning rate when creating a trainer by passing the learning rate as an optimizer_param. However, sometimes we may need to change the learning rate during training, for example when doing an explicit learning rate warmup schedule. The trainer instance provides an easy way to achieve this.
The current training rate can be accessed through the learning_rate attribute.
trainer.learning_rate
We can change it through the set_learning_rate method.
trainer.set_learning_rate(0.1) trainer.learning_rate
In addition, there are multiple pre-defined learning rate scheduling methods that are already implemented in the mxnet.lr_scheduler module. The learning rate schedulers can be incorporated into your trainer by passing them in as an optimizer_param entry. Please refer to the LR scheduler guide to learn more.
Trainer API is used to update the parameters of a network with a particular optimization algorithm.trainer.step().Trainer can be instantiated by passing in the name of the optimizer to use and the optimizer_params for that optimizer or alternatively by passing in an instance of mxnet.optimizer.Optimizer.Trainer by setting the member variable but Gluon also provides a module for learning rate scheduling.While optimization and optimizers play a significant role in deep learning model training, there are still other important components to model training. Here are a few suggestions about where to look next.