In this tutorial, we assign labels to an image with confidence scores. The following figure (source) shows an example:
Get the source code for the tutorial from GitHub.
To train models on a particular dataset, use train_dataset.py. For example:
python train_mnist.py
mkdir model; python train_mnist.py --model-prefix model/mnist
python train_mnist.py --model-prefix model/mnist --load-epoch 8
python train_mnist.py --lr .1 --lr-factor .9 --lr-factor-epoch .5
python train_mnist.py --network lenet --gpus 0
To use multiple GPUs, specify the list; for example: ---gpus 0,1,3.
To see more options, use --help.
To speed training, train a model using multiple computers.
../../tools/launch.py -n 2 python train_mnist.py --kv-store dist_sync
You can use either synchronous SGD dist_sync or asynchronous SGD dist_async.
$ cat hosts 172.30.0.172 172.30.0.171
-H:../../tools/launch.py -n 2 -H hosts python train_mnist.py --kv-store dist_sync
cp -r ../../python/mxnet . cp -r ../../lib/libmxnet.so mxnet
Then synchronize the folder to other the other computers /tmp/mxnet before running:
../../tools/launch.py -n 2 -H hosts --sync-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync
For more launch options, for example, using YARN, and information about how to write a distributed training program, see this tutorial.
You have several options for generating predictions:
There are two ways to feed data into MXNet:
Pack all examples into one or more compact recordio files. For more information, see this step-by-step tutorial and documentation. Avoid the common mistake of neglecting to shuffle the image list during packing. This causes training to fail. For example, accuracy keeps 0.001 for several rounds.
Note: We automatically download the small datasets, such as mnist and cifar10.
For small datasets, which can be easily loaded into memory, here is an example:
from sklearn.datasets import fetch_mldata from sklearn.utils import shuffle mnist = fetch_mldata('MNIST original', data_home="./mnist") # shuffle data X, y = shuffle(mnist.data, mnist.target) # split dataset train_data = X[:50000, :].astype('float32') train_label = y[:50000] val_data = X[50000: 60000, :].astype('float32') val_label = y[50000:60000] # Normalize data train_data[:] /= 256.0 val_data[:] /= 256.0 # create a numpy iterator batch_size = 100 train_iter = mx.io.NDArrayIter(train_data, train_label, batch_size=batch_size, shuffle=True) val_iter = mx.io.NDArrayIter(val_data, val_label, batch_size=batch_size) # create model as usual: model = mx.model.FeedForward(...) model.fit(X = train_iter, eval_data = val_iter)
The following factors can significantly improve performance:
A fast back end. A fast BLAS library, e.g., openblas, atlas, and mkl, is necessary only if you are using a CPU processor. For Nvidia GPUs, we strongly recommend using CUDNN.
Input data:
Data format. Use the rec format.
A number of threads used for decoding. By default, MXNet uses four CPU threads for decoding images, which can often decode more than 1 Kb images per second. If you are using a low-end CPU or very powerful GPUs, you can increase the number of threads .
Data storage location. Any local or distributed file system (HDFS, Amazon S3) should be fine. If multiple computers read the data from the network shared file system (NFS) at the same time, however, you might encounter a problem.
Batch size. We recommend using the largest size that the GPU memory can accommodate. A value that is too large might slow down convergence. A safe batch size for CIFAR 10 is approximately 200; for ImageNet 1K, the batch size can exceed 1 Kb.
If you are using more than one GPU, the right kvstore. For more information, see this guide.
For a single computer, the default local is often sufficient. For models bigger than 100 MB, such as AlexNet and VGG, you might want to use local_allreduce_device. local_allreduce_device uses more GPU memory than other options.
For multiple computers, we recommend trying to use dist_sync first. If the model is very large or if you use a large number of computers, you might want to use dist_async.
Computers
| name | hardware | software | | --- | --- | --- | | GTX980 | Xeon E5-1650 v3, 4 x GTX 980 | GCC 4.8, CUDA 7.5, CUDNN 3 | | TitanX | dual Xeon E5-2630 v3, 4 x GTX Titan X | GCC 4.8, CUDA 7.5, CUDNN 3 | | EC2-g2.8x | Xeon E5-2670, 2 x GRID K520, 10G Ethernet | GCC 4.8, CUDA 7.5, CUDNN 3 |
Datasets
| name | class | image size | training | testing | | ---- | ----: | ---------: | -------: | ------: | | CIFAR 10 | 10 | 28 × 28 × 3 | 60,000 | 10,000 | | ILSVRC 12 | 1,000 | 227 × 227 × 3 | 1,281,167 | 50,000 |
python train_cifar10.py --batch-size 128 --lr 0.1 --lr-factor .94 --num-epoch 50
Performance:
| 1 GTX 980 | 2 GTX 980 | 4 GTX 980 | | --- | --- | --- | | 842 img/sec | 1640 img/sec | 2943 img/sec |
Accuracy vs epoch (interactive figure):
train_imagenet.py with --network vgg
Performance
| Cluster | # machines | # GPUs | batch size | kvstore | epoch time |
|---|---|---|---|---|---|
| TitanX | 1 | 1 | 96 | none | 14,545 |
| - | - | 2 | - | local | 19,692 |
| - | - | 4 | - | - | 20,014 |
| - | - | 2 | - | local_allreduce_device | 9,142 |
| - | - | 4 | - | - | 8,533 |
| - | - | - | 384 | - | 5,161 |
train_imagenet.py with --network inception-bn
Performance
| Cluster | # machines | # GPUs | batch size | kvstore | epoch time | | --- | --- | --- | --- | --- | ---: | | GTX980 | 1 | 1 | 32 | `local` | 13,210 | | - | - | 2 | 64 | - | 7,198 | | - | - | 3 | 128 | - | 4,952 | | - | - | 4 | - | - | 3,589 | | TitanX | 1 | 1 | 128 | `none` | 10,666 | | - | - | 2 | - | `local` | 5,161 | | - | - | 3 | - | - | 3,460 | | - | - | 4 | - | - | 2,844 | | - | - | - | 512 | - | 2,495 | | EC2-g2.8x | 1 | 4 | 144 | `local` | 14,203 | | - | 10 | 40 | 144 | `dist_sync` | 1,422 |
Convergence
single machine :python train_imagenet.py --batch-size 144 --lr 0.05 --lr-factor .94 \ --gpus 0,1,2,3 --num-epoch 60 --network inception-bn \ --data-dir ilsvrc12/ --model-prefix model/ilsvrc12
10 x g2.8x : hosts contains the private IPs of the 10 computers../../tools/launch.py -H hosts -n 10 --sync-dir /tmp/mxnet \ python train_imagenet.py --batch-size 144 --lr 0.05 --lr-factor .94 \ --gpus 0,1,2,3 --num-epoch 60 --network inception-bn \ --kv-store dist_sync \ --data-dir s3://dmlc/ilsvrc12/ --model-prefix s3://dmlc/model/ilsvrc12
Note: Occasional instability in Amazon S3 might cause training to hang or generate frequent errors, preventing downloading data to /mnt first.
Accuracy vs. epoch (the interactive figure):