| # Use data from S3 for training |
| |
| AWS S3 is a cloud-based object storage service that allows storage and retrieval of large amounts of data at a very low cost. This makes it an attractive option to store large training datasets. MXNet is deeply integrated with S3 for this purpose. |
| |
| An S3 protocol URL (like `s3://bucket-name/training-data`) can be provided as a parameter for any data iterator that takes a file path as input. For example, |
| |
| ``` |
| data_iter = mx.io.ImageRecordIter( |
| path_imgrec="s3://bucket-name/training-data/caltech_train.rec", |
| data_shape=(3, 227, 227), |
| batch_size=4, |
| resize=256) |
| ``` |
| Following are detailed instructions on how to use data from S3 for training. |
| |
| ## Step 1: Build MXNet with S3 integration enabled |
| |
| Follow instructions [here](http://mxnet.io/get_started/install.html) to install MXNet from source with the following additional steps to enable S3 integration. |
| |
| 1. Install `libcurl4-openssl-dev` and `libssl-dev` before building MXNet. These packages are required to read/write from AWS S3. |
| 2. Append `USE_S3=1` to `config.mk` before building MXNet. |
| ``` |
| echo "USE_S3=1" >> config.mk |
| ``` |
| |
| ## Step 2: Configure S3 authentication tokens |
| |
| MXNet requires the S3 environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to be set. [Here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/) are instructions to get the access keys from AWS console. |
| |
| ``` |
| export AWS_ACCESS_KEY_ID=<your-access-key-id> |
| AWS_SECRET_ACCESS_KEY=<your-secret-access-key> |
| ``` |
| |
| ## Step 3: Upload data to S3 |
| |
| There are several ways to upload data to S3. One easy way is to use the AWS command line utility. For example, the following `sync` command will recursively copy contents from a local directory to a directory in S3. |
| |
| ``` |
| aws s3 sync ./training-data s3://bucket-name/training-data |
| ``` |
| |
| ## Step 4: Train with data from S3 |
| |
| Once the data is in S3, it is very straightforward to use it from MXNet. Any data iterator that can read/write data from a local drive can also read/write data from S3. |
| |
| Let's modify an existing example code in MXNet repository to read data from S3 instead of local disk. [`mxnet/tests/python/train/test_conv.py`](https://github.com/dmlc/mxnet/blob/master/tests/python/train/test_conv.py) trains a convolutional network using MNIST data from local disk. We'll do the following change to read the data from S3 instead. |
| |
| ``` |
| ~/mxnet$ sed -i -- 's/data\//s3:\/\/bucket-name\/training-data\//g' ./tests/python/train/test_conv.py |
| |
| ~/mxnet$ git diff ./tests/python/train/test_conv.py |
| diff --git a/tests/python/train/test_conv.py b/tests/python/train/test_conv.py |
| index 039790e..66a60ce 100644 |
| --- a/tests/python/train/test_conv.py |
| +++ b/tests/python/train/test_conv.py |
| @@ -39,14 +39,14 @@ def get_iters(): |
| |
| batch_size = 100 |
| train_dataiter = mx.io.MNISTIter( |
| - image="data/train-images-idx3-ubyte", |
| - label="data/train-labels-idx1-ubyte", |
| + image="s3://bucket-name/training-data/train-images-idx3-ubyte", |
| + label="s3://bucket-name/training-data/train-labels-idx1-ubyte", |
| data_shape=(1, 28, 28), |
| label_name='sm_label', |
| batch_size=batch_size, shuffle=True, flat=False, silent=False, seed=10) |
| val_dataiter = mx.io.MNISTIter( |
| - image="data/t10k-images-idx3-ubyte", |
| - label="data/t10k-labels-idx1-ubyte", |
| + image="s3://bucket-name/training-data/t10k-images-idx3-ubyte", |
| + label="s3://bucket-name/training-data/t10k-labels-idx1-ubyte", |
| data_shape=(1, 28, 28), |
| label_name='sm_label', |
| batch_size=batch_size, shuffle=True, flat=False, silent=False) |
| ``` |
| |
| After the above change `test_conv.py` will fetch data from S3 instead of the local disk. |
| |
| ``` |
| python ./tests/python/train/test_conv.py |
| [21:59:19] src/io/s3_filesys.cc:878: No AWS Region set, using default region us-east-1 |
| [21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(100,1,28,28) |
| [21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(100,1,28,28) |
| INFO:root:Start training with [cpu(0)] |
| Start training with [cpu(0)] |
| INFO:root:Epoch[0] Resetting Data Iterator |
| Epoch[0] Resetting Data Iterator |
| INFO:root:Epoch[0] Time cost=11.277 |
| Epoch[0] Time cost=11.277 |
| INFO:root:Epoch[0] Validation-accuracy=0.955100 |
| Epoch[0] Validation-accuracy=0.955100 |
| INFO:root:Finish fit... |
| Finish fit... |
| INFO:root:Finish predict... |
| Finish predict... |
| INFO:root:final accuracy = 0.955100 |
| final accuracy = 0.955100 |
| ``` |