| # Run MXNet on Multiple CPU/GPUs with Data Parallel |
| |
| MXNet supports training with multiple CPUs and GPUs, which may be located on different physical machines. |
| |
| ## Data Parallelism vs Model Parallelism |
| |
| In default, MXNet uses data parallelism to partition the workload over multiple |
| devices. Assume there are *n* devices, then each one will get the complete model |
| and train it on *1/n* of the data. The results such as the gradient and |
| updated model are communicated cross these devices. |
| |
| Model parallelism is also supported. In this parallelism, each device maintains a part of the model. It is useful when the model is too large to fit into a single device. There is [a tutorial](./model_parallel_lstm.md) showing how to do model parallelism for a multi-layer LSTM model. This tutorial will focus on data parallelism. |
| |
| ## Multiple GPUs within a Single Machine |
| |
| ### Workload Partitioning |
| |
| In default, MXNet will partition a data batch evenly into each GPU. Assume batch size *b* and *k* GPUs, then in one iteration |
| each GPU will perform forward and backward on *b/k* examples. The |
| gradients are then summed over all GPUs before updating the model. |
| |
| ### How to Use |
| |
| > To use GPUs, we need to compiled MXNet with GPU support. For |
| > example, set `USE_CUDA=1` in `config.mk` before `make`. (see |
| > [MXNet installation guide](http://mxnet.io/get_started/setup.html) for more options). |
| |
| If a machine has one or more than one GPU cards installed, then each card is |
| labeled by a number starting from 0. To use a particular GPU, one can often |
| either specify the context `ctx` in codes or pass `--gpus` in the command line. For |
| example, to use GPU 0 and 2 in python one can often create a model with |
| ```python |
| import mxnet as mx |
| model = mx.model.FeedForward(ctx=[mx.gpu(0), mx.gpu(2)], ...) |
| ``` |
| while if the program accepts a `--gpus` flag such as |
| [example/image-classification](https://github.com/dmlc/mxnet/tree/master/example/image-classification), |
| then we can try |
| ```bash |
| python train_mnist.py --gpus 0,2 ... |
| ``` |
| |
| ### Advanced Usage |
| |
| If the GPUs have different computation power, we can partition the workload |
| according to their powers. For example, if GPU 0 is 3 times faster than GPU 2, |
| then we provide an additional workload option `work_load_list=[3, 1]`, see |
| [model.fit](../api/python/model.html#mxnet.model.FeedForward.fit) for more |
| details. |
| |
| Training with multiple GPUs should have the same results as a single GPU if all |
| other hyper-parameters are the same. But in practice, the results vary mainly due |
| to the randomness of I/O (random order or other augmentations), weight |
| initialization with different seeds, and CUDNN. |
| |
| We can control where the gradient is aggregated and model updating if performed |
| by creating different `KVStore`, which is the module for data |
| communication. One can either use `mx.kvstore.create(type)` to get an instance or use the program flag `--kv-store type`. |
| |
| There are two commonly used types, |
| |
| - `local`: all gradients are copied to CPU memory and weights are updated there. |
| - `device`: both gradients' aggregation and weight updating are run on GPUs. It also attempts to use GPU peer-to-peer communication, which potentially accelerates the communication. But this option may result in higher GPU memory usage. |
| |
| When there is a large number of GPUs, e.g. >=4, we suggest using `device` for better performance. |
| |
| ## Distributed Training with Multiple Machines |
| |
| We can simply change the `KVStore` type to run with multiple machines. |
| |
| - `dist_sync` behaviors similarly to `local`. But one major difference is that |
| `batch-size` now means the batch size used on each machine. So if there are *n* |
| machines and we use batch size *b*, then `dist_sync` behaviors equally to `local` with batch size *n\*b*. |
| - `dist_device_sync` is identical to `dist_sync` with the difference similar to `device` vs `local`. |
| - `dist_async` performs asynchronous updating. The weight is |
| updated once received gradient from any machine. The update is atomic, |
| namely, no two updates happen on the same weight at the same time. However, |
| the order is not guaranteed. |
| |
| ### How to Launch a Job |
| |
| > To use distributed training, we need to compile with `USE_DIST_KVSTORE=1` |
| > (see [MXNet installation guide](http://mxnet.io/get_started/setup.html) for more options). |
| |
| Launching a distributed job is a bit different from running on a single |
| machine. MXNet provides |
| [tools/launch.py](https://github.com/dmlc/mxnet/blob/master/tools/launch.py) to |
| start a job by using `ssh`, `mpi`, `sge`, or `yarn`. |
| |
| Assume we are at the directory `mxnet/example/image-classification`. and want |
| to train mnist with lenet by using |
| [train_mnist.py](https://github.com/dmlc/mxnet/blob/master/example/image-classification/train_mnist.py). |
| On a single machine, we can run by |
| |
| ```bash |
| python train_mnist.py --network lenet |
| ``` |
| |
| Now if there are two ssh-able machines, and we want to train it on these two |
| machines. |
| First, we save the IPs (or hostname) of these two machines in file `hosts`, e.g. |
| |
| ```bash |
| $ cat hosts |
| 172.30.0.172 |
| 172.30.0.171 |
| ``` |
| |
| Next, if the mxnet folder is accessible by both machines, e.g. on a |
| [network filesystem](https://help.ubuntu.com/lts/serverguide/network-file-system.html), |
| then we can run by |
| |
| ```bash |
| python ../../tools/launch.py -n 2 --launcher ssh -H hosts python train_mnist.py --network lenet --kv-store dist_sync |
| ``` |
| |
| Note that, besides the single machine arguments, here we |
| |
| - use `launch.py` to submit the job |
| - provide launcher, `ssh` if all machines are ssh-able, `mpi` if `mpirun` is |
| available, `sge` for Sun Grid Engine, and `yarn` for Apache Yarn. |
| - `-n` number of worker nodes to run |
| - `-H` the host file which is required by `ssh` and `mpi` |
| - `--kv-store` use either `dist_sync` or `dist_async` |
| |
| |
| ### Synchronize Directory |
| |
| Now consider if the mxnet folder is not accessible. We can first copy the MXNet |
| library to this folder by |
| ```bash |
| cp -r ../../python/mxnet . |
| cp -r ../../lib/libmxnet.so mxnet |
| ``` |
| |
| then ask `launch.py` to synchronize the current directory to all machines' |
| `/tmp/mxnet` directory with `--sync-dst-dir` |
| |
| ```bash |
| python ../../tools/launch.py -n 2 -H hosts --sync-dst-dir /tmp/mxnet \ |
| python train_mnist.py --network lenet --kv-store dist_sync |
| ``` |
| |
| ### Use a Particular Network Interface |
| |
| MXNet often chooses the first available network interface. But for machines have |
| multiple interfaces, we can specify which network interface to use for data |
| communication by the environment variable `DMLC_INTERFACE`. For example, to use |
| the interface `eth0`, we can |
| |
| ``` |
| export DMLC_INTERFACE=eth0; python ../../tools/launch.py ... |
| ``` |
| |
| ### Debug Connection |
| |
| Set`PS_VERBOSE=1` to see the debug logging, e.g |
| ``` |
| export PS_VERBOSE=1; python ../../tools/launch.py ... |
| ``` |
| |
| ### More |
| |
| - See more launch options by `python ../../tools/launch.py -h` |
| - See more options of [ps-lite](http://ps-lite.readthedocs.org/en/latest/how_to.html) |