docs/how_to/s3_integration.md - mxnet-test - Git at Google

 # Use data from S3 for training

 AWS S3 is a cloud-based object storage service that allows storage and retrieval of large amounts of data at a very low cost. This makes it an attractive option to store large training datasets. MXNet is deeply integrated with S3 for this purpose.

 An S3 protocol URL (like `s3://bucket-name/training-data`) can be provided as a parameter for any data iterator that takes a file path as input. For example,

 ```
 data_iter = mx.io.ImageRecordIter(
     path_imgrec="s3://bucket-name/training-data/caltech_train.rec",
     data_shape=(3, 227, 227),
     batch_size=4,
     resize=256)
 ```
 Following are detailed instructions on how to use data from S3 for training.

 ## Step 1: Build MXNet with S3 integration enabled

 Follow instructions [here](http://mxnet.io/get_started/install.html) to install MXNet from source with the following additional steps to enable S3 integration.

 1. Install `libcurl4-openssl-dev` and `libssl-dev` before building MXNet. These packages are required to read/write from AWS S3.
 2. Append `USE_S3=1` to `config.mk` before building MXNet.
     ```
     echo "USE_S3=1" >> config.mk
     ```

 ## Step 2: Configure S3 authentication tokens

 MXNet requires the S3 environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to be set. [Here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/) are instructions to get the access keys from AWS console.

 ```
 export AWS_ACCESS_KEY_ID=<your-access-key-id>
 AWS_SECRET_ACCESS_KEY=<your-secret-access-key>
 ```

 ## Step 3: Upload data to S3

 There are several ways to upload data to S3. One easy way is to use the AWS command line utility. For example, the following `sync` command will recursively copy contents from a local directory to a directory in S3.

 ```
 aws s3 sync ./training-data s3://bucket-name/training-data
 ```

 ## Step 4: Train with data from S3

 Once the data is in S3, it is very straightforward to use it from MXNet. Any data iterator that can read/write data from a local drive can also read/write data from S3.

 Let's modify an existing example code in MXNet repository to read data from S3 instead of local disk. [`mxnet/tests/python/train/test_conv.py`](https://github.com/dmlc/mxnet/blob/master/tests/python/train/test_conv.py) trains a convolutional network using MNIST data from local disk. We'll do the following change to read the data from S3 instead.

 ```
 ~/mxnet$ sed -i -- 's/data\//s3:\/\/bucket-name\/training-data\//g' ./tests/python/train/test_conv.py

 ~/mxnet$ git diff ./tests/python/train/test_conv.py
 diff --git a/tests/python/train/test_conv.py b/tests/python/train/test_conv.py
 index 039790e..66a60ce 100644
 --- a/tests/python/train/test_conv.py
 +++ b/tests/python/train/test_conv.py
 @@ -39,14 +39,14 @@ def get_iters():

      batch_size = 100
      train_dataiter = mx.io.MNISTIter(
 -            image="data/train-images-idx3-ubyte",
 -            label="data/train-labels-idx1-ubyte",
 +            image="s3://bucket-name/training-data/train-images-idx3-ubyte",
 +            label="s3://bucket-name/training-data/train-labels-idx1-ubyte",
              data_shape=(1, 28, 28),
              label_name='sm_label',
              batch_size=batch_size, shuffle=True, flat=False, silent=False, seed=10)
      val_dataiter = mx.io.MNISTIter(
 -            image="data/t10k-images-idx3-ubyte",
 -            label="data/t10k-labels-idx1-ubyte",
 +            image="s3://bucket-name/training-data/t10k-images-idx3-ubyte",
 +            label="s3://bucket-name/training-data/t10k-labels-idx1-ubyte",
              data_shape=(1, 28, 28),
              label_name='sm_label',
              batch_size=batch_size, shuffle=True, flat=False, silent=False)
 ```

 After the above change `test_conv.py` will fetch data from S3 instead of the local disk.

 ```
 python ./tests/python/train/test_conv.py
 [21:59:19] src/io/s3_filesys.cc:878: No AWS Region set, using default region us-east-1
 [21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(100,1,28,28)
 [21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(100,1,28,28)
 INFO:root:Start training with [cpu(0)]
 Start training with [cpu(0)]
 INFO:root:Epoch[0] Resetting Data Iterator
 Epoch[0] Resetting Data Iterator
 INFO:root:Epoch[0] Time cost=11.277
 Epoch[0] Time cost=11.277
 INFO:root:Epoch[0] Validation-accuracy=0.955100
 Epoch[0] Validation-accuracy=0.955100
 INFO:root:Finish fit...
 Finish fit...
 INFO:root:Finish predict...
 Finish predict...
 INFO:root:final accuracy = 0.955100
 final accuracy = 0.955100
 ```
	# Use data from S3 for training

	AWS S3 is a cloud-based object storage service that allows storage and retrieval of large amounts of data at a very low cost. This makes it an attractive option to store large training datasets. MXNet is deeply integrated with S3 for this purpose.

	An S3 protocol URL (like `s3://bucket-name/training-data`) can be provided as a parameter for any data iterator that takes a file path as input. For example,

	```
	data_iter = mx.io.ImageRecordIter(
	path_imgrec="s3://bucket-name/training-data/caltech_train.rec",
	data_shape=(3, 227, 227),
	batch_size=4,
	resize=256)
	```
	Following are detailed instructions on how to use data from S3 for training.

	## Step 1: Build MXNet with S3 integration enabled

	Follow instructions [here](http://mxnet.io/get_started/install.html) to install MXNet from source with the following additional steps to enable S3 integration.

	1. Install `libcurl4-openssl-dev` and `libssl-dev` before building MXNet. These packages are required to read/write from AWS S3.
	2. Append `USE_S3=1` to `config.mk` before building MXNet.
	```
	echo "USE_S3=1" >> config.mk
	```

	## Step 2: Configure S3 authentication tokens

	MXNet requires the S3 environment variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to be set. [Here](https://aws.amazon.com/blogs/security/wheres-my-secret-access-key/) are instructions to get the access keys from AWS console.

	```
	export AWS_ACCESS_KEY_ID=<your-access-key-id>
	AWS_SECRET_ACCESS_KEY=<your-secret-access-key>
	```

	## Step 3: Upload data to S3

	There are several ways to upload data to S3. One easy way is to use the AWS command line utility. For example, the following `sync` command will recursively copy contents from a local directory to a directory in S3.

	```
	aws s3 sync ./training-data s3://bucket-name/training-data
	```

	## Step 4: Train with data from S3

	Once the data is in S3, it is very straightforward to use it from MXNet. Any data iterator that can read/write data from a local drive can also read/write data from S3.

	Let's modify an existing example code in MXNet repository to read data from S3 instead of local disk. [`mxnet/tests/python/train/test_conv.py`](https://github.com/dmlc/mxnet/blob/master/tests/python/train/test_conv.py) trains a convolutional network using MNIST data from local disk. We'll do the following change to read the data from S3 instead.

	```
	~/mxnet$ sed -i -- 's/data\//s3:\/\/bucket-name\/training-data\//g' ./tests/python/train/test_conv.py

	~/mxnet$ git diff ./tests/python/train/test_conv.py
	diff --git a/tests/python/train/test_conv.py b/tests/python/train/test_conv.py
	index 039790e..66a60ce 100644
	--- a/tests/python/train/test_conv.py
	+++ b/tests/python/train/test_conv.py
	@@ -39,14 +39,14 @@ def get_iters():

	batch_size = 100
	train_dataiter = mx.io.MNISTIter(
	- image="data/train-images-idx3-ubyte",
	- label="data/train-labels-idx1-ubyte",
	+ image="s3://bucket-name/training-data/train-images-idx3-ubyte",
	+ label="s3://bucket-name/training-data/train-labels-idx1-ubyte",
	data_shape=(1, 28, 28),
	label_name='sm_label',
	batch_size=batch_size, shuffle=True, flat=False, silent=False, seed=10)
	val_dataiter = mx.io.MNISTIter(
	- image="data/t10k-images-idx3-ubyte",
	- label="data/t10k-labels-idx1-ubyte",
	+ image="s3://bucket-name/training-data/t10k-images-idx3-ubyte",
	+ label="s3://bucket-name/training-data/t10k-labels-idx1-ubyte",
	data_shape=(1, 28, 28),
	label_name='sm_label',
	batch_size=batch_size, shuffle=True, flat=False, silent=False)
	```

	After the above change `test_conv.py` will fetch data from S3 instead of the local disk.

	```
	python ./tests/python/train/test_conv.py
	[21:59:19] src/io/s3_filesys.cc:878: No AWS Region set, using default region us-east-1
	[21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 60000 images, shuffle=1, shape=(100,1,28,28)
	[21:59:21] src/io/iter_mnist.cc:94: MNISTIter: load 10000 images, shuffle=1, shape=(100,1,28,28)
	INFO:root:Start training with [cpu(0)]
	Start training with [cpu(0)]
	INFO:root:Epoch[0] Resetting Data Iterator
	Epoch[0] Resetting Data Iterator
	INFO:root:Epoch[0] Time cost=11.277
	Epoch[0] Time cost=11.277
	INFO:root:Epoch[0] Validation-accuracy=0.955100
	Epoch[0] Validation-accuracy=0.955100
	INFO:root:Finish fit...
	Finish fit...
	INFO:root:Finish predict...
	Finish predict...
	INFO:root:final accuracy = 0.955100
	final accuracy = 0.955100
	```