Image Classification

In this tutorial, we assign labels to an image with confidence scores. The following figure (source) shows an example:

Get the source code for the tutorial from GitHub.

Training

To train models on a particular dataset, use train_dataset.py. For example:

To train an MLP on mnist, use this command:

  python train_mnist.py

To save the models for each epoch, use this command:

  mkdir model; python train_mnist.py --model-prefix model/mnist

To restart training from the model saved at epoch 8, use this command:

  python train_mnist.py --model-prefix model/mnist --load-epoch 8

To choose another initial learning rate, and decay it by 0.9 for every half epoch, use this command:

  python train_mnist.py --lr .1 --lr-factor .9 --lr-factor-epoch .5

To train a convolutional neural network on mnist by using GPU 0, use this command:

  python train_mnist.py --network lenet --gpus 0

To use multiple GPUs, specify the list; for example: ---gpus 0,1,3.
To see more options, use --help.

Distributed Training

To speed training, train a model using multiple computers.

Quickly test distributed training on your local computer by using two workers:

  ../../tools/launch.py -n 2 python train_mnist.py --kv-store dist_sync

You can use either synchronous SGD dist_sync or asynchronous SGD dist_async.

If you have several computers that you can connect to using SSH, and if this mxnet folder is accessible on these computers (is mounted as an NFS; see a tutorial for Ubuntu), run a job on these computers, first by saving their hostnames on a file, for example:

  $ cat hosts
  172.30.0.172
  172.30.0.171

Then pass this file using -H:

  ../../tools/launch.py -n 2 -H hosts python train_mnist.py --kv-store dist_sync

If the mxnet folder isn't available on the other computers, copy the mxnet library to this example folder:

  cp -r ../../python/mxnet .
  cp -r ../../lib/libmxnet.so mxnet

Then synchronize the folder to other the other computers /tmp/mxnet before running:

  ../../tools/launch.py -n 2 -H hosts --sync-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync

For more launch options, for example, using YARN, and information about how to write a distributed training program, see this tutorial.

Generating Predictions

You have several options for generating predictions:

Use a pre-trained model. More pre-trained models are provided in the model gallery.
Use your own datasets.
You can also easily run the prediction on various devices, such as Android/iOS.

Using Your Own Datasets

There are two ways to feed data into MXNet:

Pack all examples into one or more compact recordio files. For more information, see this step-by-step tutorial and documentation. Avoid the common mistake of neglecting to shuffle the image list during packing. This causes training to fail. For example, accuracy keeps 0.001 for several rounds.
Note: We automatically download the small datasets, such as mnist and cifar10.
For small datasets, which can be easily loaded into memory, here is an example:

	    from sklearn.datasets import fetch_mldata
	    from sklearn.utils import shuffle
	    mnist = fetch_mldata('MNIST original', data_home="./mnist")
	    # shuffle data
	    X, y = shuffle(mnist.data, mnist.target)
	    # split dataset
	    train_data = X[:50000, :].astype('float32')
	    train_label = y[:50000]
	    val_data = X[50000: 60000, :].astype('float32')
	    val_label = y[50000:60000]
	    # Normalize data
	    train_data[:] /= 256.0
	    val_data[:] /= 256.0
	    # create a numpy iterator
	    batch_size = 100
	    train_iter = mx.io.NDArrayIter(train_data, train_label, batch_size=batch_size, shuffle=True)
	    val_iter = mx.io.NDArrayIter(val_data, val_label, batch_size=batch_size)
	    # create model as usual: model = mx.model.FeedForward(...)
		model.fit(X = train_iter, eval_data = val_iter)

Improving Performance

The following factors can significantly improve performance:

A fast back end. A fast BLAS library, e.g., openblas, atlas, and mkl, is necessary only if you are using a CPU processor. For Nvidia GPUs, we strongly recommend using CUDNN.
Input data:
- Data format. Use the rec format.
- A number of threads used for decoding. By default, MXNet uses four CPU threads for decoding images, which can often decode more than 1 Kb images per second. If you are using a low-end CPU or very powerful GPUs, you can increase the number of threads .
- Data storage location. Any local or distributed file system (HDFS, Amazon S3) should be fine. If multiple computers read the data from the network shared file system (NFS) at the same time, however, you might encounter a problem.
- Batch size. We recommend using the largest size that the GPU memory can accommodate. A value that is too large might slow down convergence. A safe batch size for CIFAR 10 is approximately 200; for ImageNet 1K, the batch size can exceed 1 Kb.
If you are using more than one GPU, the right kvstore. For more information, see this guide.
- For a single computer, the default local is often sufficient. For models bigger than 100 MB, such as AlexNet and VGG, you might want to use local_allreduce_device. local_allreduce_device uses more GPU memory than other options.
- For multiple computers, we recommend trying to use dist_sync first. If the model is very large or if you use a large number of computers, you might want to use dist_async.

Results

Computers

| name | hardware | software |
| --- | --- | --- |
| GTX980 | Xeon E5-1650 v3, 4 x GTX 980 | GCC 4.8, CUDA 7.5, CUDNN 3 |
| TitanX | dual Xeon E5-2630 v3, 4 x GTX Titan X | GCC 4.8, CUDA 7.5, CUDNN 3 |
| EC2-g2.8x | Xeon E5-2670, 2 x GRID K520, 10G Ethernet | GCC 4.8, CUDA 7.5, CUDNN 3 |

Datasets

| name | class | image size | training | testing |
| ---- | ----: | ---------: | -------: | ------: |
| CIFAR 10 | 10 | 28 × 28 × 3 | 60,000  | 10,000 |
| ILSVRC 12 | 1,000 | 227 × 227 × 3 | 1,281,167 | 50,000 |

CIFAR 10

Command

python train_cifar10.py --batch-size 128 --lr 0.1 --lr-factor .94 --num-epoch 50

Performance:

| 1 GTX 980 | 2 GTX 980 | 4 GTX 980 |
| --- | --- | --- |
| 842 img/sec | 1640 img/sec | 2943 img/sec |

Accuracy vs epoch (interactive figure):

ILSVRC 12

VGG

train_imagenet.py with --network vgg

Performance
Cluster # machines # GPUs batch size kvstore epoch time
TitanX 1 1 96 none 14,545
- - 2 - local 19,692
- - 4 - - 20,014
- - 2 - local_allreduce_device 9,142
- - 4 - - 8,533
- - - 384 - 5,161

Cluster	# machines	# GPUs	batch size	kvstore	epoch time
TitanX	1	1	96	`none`	14,545
-	-	2	-	`local`	19,692
-	-	4	-	-	20,014
-	-	2	-	`local_allreduce_device`	9,142
-	-	4	-	-	8,533
-	-	-	384	-	5,161

Inception with Batch Normalization

train_imagenet.py with --network inception-bn

Performance

| Cluster | # machines | # GPUs | batch size | kvstore | epoch time |
| --- | --- | --- | --- | --- | ---: |
| GTX980 | 1 | 1 |  32 | `local` | 13,210 |
| - | - | 2 |  64 | - | 7,198 |
| - | - | 3 |  128 | - | 4,952 |
| - | - | 4 |  - | - | 3,589 |
| TitanX | 1 | 1 | 128 | `none` | 10,666 |
| - | - | 2 | - | `local` | 5,161 |
| - | - | 3 | - | - | 3,460 |
| - | - | 4 | - | - | 2,844 |
| - | - | - | 512 | - | 2,495 |
| EC2-g2.8x | 1 | 4 | 144 |  `local` | 14,203 |
| - | 10 | 40 | 144 |  `dist_sync` | 1,422 |

Convergence
- single machine :

  python train_imagenet.py --batch-size 144 --lr 0.05 --lr-factor .94 \
      --gpus 0,1,2,3 --num-epoch 60 --network inception-bn \
      --data-dir ilsvrc12/ --model-prefix model/ilsvrc12

10 x g2.8x : hosts contains the private IPs of the 10 computers

  ../../tools/launch.py -H hosts -n 10 --sync-dir /tmp/mxnet  \
      python train_imagenet.py --batch-size 144 --lr 0.05 --lr-factor .94 \
        --gpus 0,1,2,3 --num-epoch 60 --network inception-bn \
        --kv-store dist_sync \
        --data-dir s3://dmlc/ilsvrc12/  --model-prefix s3://dmlc/model/ilsvrc12

Note: Occasional instability in Amazon S3 might cause training to hang or generate frequent errors, preventing downloading data to /mnt first.

Accuracy vs. epoch (the interactive figure):

Next Steps

MXNet tutorials index