blob: 1dc575d3f483122933bf982ca5a58e3ba96b3543 [file] [log] [blame] [view]
# Cloud Setup for MXNet
## Set Up an AWS GPU Cluster from Scratch
In this topic, we provide a step-by-step tutorial on how to set up an AWS cluster with
MXNet. We show how to:
- [Use Amazon S3 to host data](#use-amazon-s3-to-host-data)
- [Set up an EC2 GPU instance with all dependencies installed](#set-up-an-ec2-gpu-instance)
- [Build and run MXNet on a single computer](#build-and-run-mxnet-on-a-gpu-instance)
- [Set up an EC2 GPU cluster for distributed training](#set-up-an-ec2-gpu-cluster-for-distributed-training)
### Use Amazon S3 to Host Data
Amazon S3 provides distributed data storage, which is convenient for hosting large datasets. To use S3, you need [AWS credentials](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html),
which include an `ACCESS_KEY_ID` and a `SECRET_ACCESS_KEY`.
To use MXNet with S3, set the environment variables `AWS_ACCESS_KEY_ID` and
`AWS_SECRET_ACCESS_KEY` by adding the following two lines in
`~/.bashrc` (replacing the strings with the correct ones):
```bash
export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
```
There are several ways to upload data to S3. A simple way is using
[s3cmd](http://s3tools.org/s3cmd). For example:
```bash
wget http://data.mxnet.io/mxnet/data/mnist.zip
unzip mnist.zip && s3cmd put t*-ubyte s3://dmlc/mnist/
```
### Set Up an EC2 GPU Instance
MXNet requires the following libraries:
- C++ compiler with C++11 support, such as `gcc >= 4.8`
- `CUDA` (`CUDNN` in optional) for GPU linear algebra
- `BLAS` (cblas, open-blas, atblas, mkl, or others) for CPU linear algebra
- `opencv` for image augmentations
- `curl` and `openssl` for the ability to read/write to Amazon S3
Installing `CUDA` on EC2 instances requires some effort. Caffe has a good
[tutorial](https://github.com/BVLC/caffe/wiki/Install-Caffe-on-EC2-from-scratch-(Ubuntu,-CUDA-7,-cuDNN-3))
on how to install CUDA 7.0 on Ubuntu 14.04.
***Note:*** We tried CUDA 7.5 on Nov 7,
2015, but found it problematic.
You can install the rest using the package manager. For example, on Ubuntu:
```
sudo apt-get update
sudo apt-get install -y build-essential git libcurl4-openssl-dev libatlas-base-dev libopencv-dev python-numpy
```
The Amazon Machine Image [ami-12fd8178](https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-12fd8178) has the packages listed above installed.
### Build and Run MXNet on a GPU Instance
The following commands build MXNet with CUDA/CUDNN, Amazon S3, and distributed
training.
```bash
git clone --recursive https://github.com/dmlc/mxnet
cd mxnet; cp make/config.mk .
echo "USE_CUDA=1" >>config.mk
echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk
echo "USE_CUDNN=1" >>config.mk
echo "USE_BLAS=atlas" >> config.mk
echo "USE_DIST_KVSTORE = 1" >>config.mk
echo "USE_S3=1" >>config.mk
make -j$(nproc)
```
To test whether everything is installed properly, train a convolutional neural network (CNN) on MNIST using a GPU:
```bash
python tests/python/gpu/test_conv.py
```
If you've placed the MNIST data on `s3://dmlc/mnist`, you can read the data stored on Amazon S3 directly with the following command:
```bash
sed -i.bak "s!data_dir = 'data'!data_dir = 's3://dmlc/mnist'!" tests/python/gpu/test_conv.py
```
***Note:*** You can use `sudo ln /dev/null /dev/raw1394` to fix the opencv error `libdc1394 error: Failed to initialize libdc1394`.
### Set Up an EC2 GPU Cluster for Distributed Training
A cluster consists of multiple computers. You can use the computer with MXNet
installed as the root computer for submitting jobs, and then launch several
slave computers to run the jobs. For example, launch multiple instances using an
AMI, e.g.,
[ami-12fd8178](https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-12fd8178),
with dependencies installed. There are two options:
- Make all slaves' ports accessible (same for the root) by setting type: All TCP,
Source: Anywhere in Configure Security Group.
- Use the same `pem` as the root computer to access all slave computers, and
then copy the `pem` file into the root computer's `~/.ssh/id_rsa`. If you do this, all slave computers can be accessed with SSH from the root.
Now, run the CNN on multiple computers. Assume that we are on a working
directory of the root computer, such as `~/train`, and MXNet is built as `~/mxnet`.
1. Pack the MXNet Python library into this working directory for easy
synchronization:
```bash
cp -r ~/mxnet/python/mxnet .
cp ~/mxnet/lib/libmxnet.so mxnet/
```
And then copy the training program:
```bash
cp ~/mxnet/example/image-classification/*.py .
cp -r ~/mxnet/example/image-classification/common .
```
2. Prepare a host file with all slaves private IPs. For example, `cat hosts`:
```bash
172.30.0.172
172.30.0.171
```
3. Assuming that there are two computers, train the CNN using two workers:
```bash
../../tools/launch.py -n 2 -H hosts --sync-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync
```
***Note:*** Sometimes the jobs linger at the slave computers even you've pressed `Ctrl-c`
at the root node. To terminate them, use the following command:
```bash
cat hosts | xargs -I{} ssh -o StrictHostKeyChecking=no {} 'uname -a; pgrep python | xargs kill -9'
```
***Note:*** The preceding example is very simple to train, and, therefore, isn't a good
benchmark for distributed training. Consider using other [examples](https://github.com/dmlc/mxnet/tree/master/example/image-classification).
### More Options
#### Use Multiple Data Shards
It is common to pack a dataset into multiple files, especially when working in a distributed environment. MXNet supports direct loading from multiple data shards. Put all of the record files into a folder, and point the data path to the folder.
#### Use YARN and SGE
Although using SSH can be simple when you don't have a cluster scheduling framework, MXNet is designed to be portable to various platforms. We provide other scripts in [tracker](https://github.com/dmlc/dmlc-core/tree/master/tracker) to allow running on other cluster frameworks, including Hadoop (YARN) and SGE. We welcome your contribution of examples of running MXNet on your favorite distributed platform.