| # MXNet on the Cloud | |
| Deep learning can require extremely powerful hardware, often for unpredictable durations of time. | |
| Moreover, _MXNet_ can benefit from both multiple GPUs and multiple machines. | |
| Accordingly, cloud computing, as offered by AWS and others, | |
| is especially well suited to training deep learning models. | |
| Using AWS, we can rapidly fire up multiple machines with multiple GPUs each at will | |
| and maintain the resources for precisely the amount of time needed. | |
| ## Set Up an AWS GPU Cluster from Scratch | |
| In this document, we provide a step-by-step guide that will teach you | |
| how to set up an AWS cluster with _MXNet_. We show how to: | |
| - [Use Amazon S3 to host data](#use-amazon-s3-to-host-data) | |
| - [Set up an EC2 GPU instance with all dependencies installed](#set-up-an-ec2-gpu-instance) | |
| - [Build and run MXNet on a single computer](#build-and-run-mxnet-on-a-gpu-instance) | |
| - [Set up an EC2 GPU cluster for distributed training](#set-up-an-ec2-gpu-cluster-for-distributed-training) | |
| ### Use Amazon S3 to Host Data | |
| Amazon S3 provides distributed data storage which proves especially convenient for hosting large datasets. | |
| To use S3, you need [AWS credentials](http://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSGettingStartedGuide/AWSCredentials.html), | |
| including an `ACCESS_KEY_ID` and a `SECRET_ACCESS_KEY`. | |
| To use _MXNet_ with S3, set the environment variables `AWS_ACCESS_KEY_ID` and | |
| `AWS_SECRET_ACCESS_KEY` by adding the following two lines in | |
| `~/.bashrc` (replacing the strings with the correct ones): | |
| ```bash | |
| export AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE | |
| export AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY | |
| ``` | |
| There are several ways to upload data to S3. One simple way is to use | |
| [s3cmd](http://s3tools.org/s3cmd). For example: | |
| ```bash | |
| wget http://data.mxnet.io/mxnet/data/mnist.zip | |
| unzip mnist.zip && s3cmd put t*-ubyte s3://dmlc/mnist/ | |
| ``` | |
| ### Use Pre-installed EC2 GPU Instance | |
| The [Deep Learning AMI](https://aws.amazon.com/marketplace/pp/B01M0AXXQB?qid=1475211685369&sr=0-1&ref_=srh_res_product_title) is an Amazon Linux image | |
| supported and maintained by Amazon Web Services for use on Amazon Elastic Compute Cloud (Amazon EC2). | |
| It contains [MXNet-v0.9.3 tag](https://github.com/dmlc/mxnet) and the necessary components to get going with deep learning, | |
| including Nvidia drivers, CUDA, cuDNN, Anaconda, Python2 and Python3. | |
| The AMI IDs are the following: | |
| * us-east-1: ami-e7c96af1 | |
| * us-west-2: ami-dfb13ebf | |
| * eu-west-1: ami-6e5d6808 | |
| Now you can launch _MXNet_ directly on an EC2 GPU instance. | |
| You can also use [Jupyter](http://jupyter.org) notebook on EC2 machine. | |
| Here is a [good tutorial](https://github.com/dmlc/mxnet-notebooks) | |
| on how to connect to a Jupyter notebook running on an EC2 instance. | |
| ### Set Up an EC2 GPU Instance from Scratch | |
| _MXNet_ requires the following libraries: | |
| - C++ compiler with C++11 support, such as `gcc >= 4.8` | |
| - `CUDA` (`CUDNN` in optional) for GPU linear algebra | |
| - `BLAS` (cblas, open-blas, atblas, mkl, or others) for CPU linear algebra | |
| - `opencv` for image augmentations | |
| - `curl` and `openssl` for the ability to read/write to Amazon S3 | |
| Installing `CUDA` on EC2 instances requires some effort. Caffe has a good | |
| [tutorial](https://github.com/BVLC/caffe/wiki/Install-Caffe-on-EC2-from-scratch-(Ubuntu,-CUDA-7,-cuDNN-3)) | |
| on how to install CUDA 7.0 on Ubuntu 14.04. | |
| ***Note:*** We tried CUDA 7.5 on Nov 7, 2015, but found it problematic. | |
| You can install the rest using the package manager. For example, on Ubuntu: | |
| ``` | |
| sudo apt-get update | |
| sudo apt-get install -y build-essential git libcurl4-openssl-dev libatlas-base-dev libopencv-dev python-numpy | |
| ``` | |
| The Amazon Machine Image (AMI) [ami-12fd8178](https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-12fd8178) has the packages listed above installed. | |
| ### Build and Run MXNet on a GPU Instance | |
| The following commands build _MXNet_ with CUDA/CUDNN, Amazon S3, and distributed | |
| training. | |
| ```bash | |
| git clone --recursive https://github.com/dmlc/mxnet | |
| cd mxnet; cp make/config.mk . | |
| echo "USE_CUDA=1" >>config.mk | |
| echo "USE_CUDA_PATH=/usr/local/cuda" >>config.mk | |
| echo "USE_CUDNN=1" >>config.mk | |
| echo "USE_BLAS=atlas" >> config.mk | |
| echo "USE_DIST_KVSTORE = 1" >>config.mk | |
| echo "USE_S3=1" >>config.mk | |
| make -j$(nproc) | |
| ``` | |
| To test whether everything is installed properly, we can try training a convolutional neural network (CNN) on the MNIST dataset using a GPU: | |
| ```bash | |
| python example/image-classification/train_mnist.py | |
| ``` | |
| If you've placed the MNIST data on `s3://dmlc/mnist`, you can read the data stored on Amazon S3 directly with the following command: | |
| ```bash | |
| sed -i.bak "s!data_dir = 'data'!data_dir = 's3://dmlc/mnist'!" example/image-classification/train_mnist.py | |
| ``` | |
| ***Note:*** You can use `sudo ln /dev/null /dev/raw1394` to fix the opencv error `libdc1394 error: Failed to initialize libdc1394`. | |
| ### Set Up an EC2 GPU Cluster for Distributed Training | |
| A cluster consists of multiple computers. | |
| You can use one computer with _MXNet_ installed as the root computer for submitting jobs,and then launch several | |
| slave computers to run the jobs. For example, launch multiple instances using an | |
| AMI, e.g., | |
| [ami-12fd8178](https://console.aws.amazon.com/ec2/v2/home?region=us-east-1#LaunchInstanceWizard:ami=ami-12fd8178), | |
| with dependencies installed. There are two options: | |
| - Make all slaves' ports accessible (same for the root) by setting type: All TCP, | |
| Source: Anywhere in Configure Security Group. | |
| - Use the same `pem` as the root computer to access all slave computers, and | |
| then copy the `pem` file into the root computer's `~/.ssh/id_rsa`. If you do this, all slave computers can be accessed with SSH from the root. | |
| Now, run the CNN on multiple computers. Assume that we are on a working | |
| directory of the root computer, such as `~/train`, and MXNet is built as `~/mxnet`. | |
| 1. Pack the _MXNet_ Python library into this working directory for easy | |
| synchronization: | |
| ```bash | |
| cp -r ~/mxnet/python/mxnet . | |
| cp ~/mxnet/lib/libmxnet.so mxnet/ | |
| ``` | |
| And then copy the training program: | |
| ```bash | |
| cp ~/mxnet/example/image-classification/*.py . | |
| cp -r ~/mxnet/example/image-classification/common . | |
| ``` | |
| 2. Prepare a host file with all slaves private IPs. For example, `cat hosts`: | |
| ```bash | |
| 172.30.0.172 | |
| 172.30.0.171 | |
| ``` | |
| 3. Assuming that there are two computers, train the CNN using two workers: | |
| ```bash | |
| ../../tools/launch.py -n 2 -H hosts --sync-dir /tmp/mxnet python train_mnist.py --kv-store dist_sync | |
| ``` | |
| ***Note:*** Sometimes the jobs linger at the slave computers even though you've pressed `Ctrl-c` | |
| at the root node. To terminate them, use the following command: | |
| ```bash | |
| cat hosts | xargs -I{} ssh -o StrictHostKeyChecking=no {} 'uname -a; pgrep python | xargs kill -9' | |
| ``` | |
| ***Note:*** The preceding example is very simple to train and therefore isn't a good | |
| benchmark for distributed training. Consider using other [examples](https://github.com/dmlc/mxnet/tree/master/example/image-classification). | |
| ### More Options | |
| #### Use Multiple Data Shards | |
| It is common to pack a dataset into multiple files, especially when working in a distributed environment. | |
| _MXNet_ supports direct loading from multiple data shards. | |
| Put all of the record files into a folder, and point the data path to the folder. | |
| #### Use YARN and SGE | |
| Although using SSH can be simple when you don't have a cluster scheduling framework, | |
| _MXNet_ is designed to be portable to various platforms. | |
| We provide scripts available in [tracker](https://github.com/dmlc/dmlc-core/tree/master/tracker) | |
| to allow running on other cluster frameworks, including Hadoop (YARN) and SGE. | |
| We welcome contributions from the community of examples of running _MXNet_ on your favorite distributed platform. |