julia/docs/src/tutorial/mnist.md - mxnet - Git at Google

 <!--- Licensed to the Apache Software Foundation (ASF) under one -->
 <!--- or more contributor license agreements.  See the NOTICE file -->
 <!--- distributed with this work for additional information -->
 <!--- regarding copyright ownership.  The ASF licenses this file -->
 <!--- to you under the Apache License, Version 2.0 (the -->
 <!--- "License"); you may not use this file except in compliance -->
 <!--- with the License.  You may obtain a copy of the License at -->

 <!---   http://www.apache.org/licenses/LICENSE-2.0 -->

 <!--- Unless required by applicable law or agreed to in writing, -->
 <!--- software distributed under the License is distributed on an -->
 <!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
 <!--- KIND, either express or implied.  See the License for the -->
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->

 Digit Recognition on MNIST
 ==========================

 In this tutorial, we will work through examples of training a simple
 multi-layer perceptron and then a convolutional neural network (the
 LeNet architecture) on the [MNIST handwritten digit
 dataset](http://yann.lecun.com/exdb/mnist/). The code for this tutorial
 could be found in
 [examples/mnist](https://github.com/dmlc/MXNet.jl/tree/master/examples/mnist).  There are also two Jupyter notebooks that expand a little more on the [MLP](https://github.com/ultradian/julia_notebooks/blob/master/mxnet/mnistMLP.ipynb) and the [LeNet](https://github.com/ultradian/julia_notebooks/blob/master/mxnet/mnistLenet.ipynb), using the more general `ArrayDataProvider`.

 Simple 3-layer MLP
 ------------------

 This is a tiny 3-layer MLP that could be easily trained on CPU. The
 script starts with

 ```julia
 using MXNet
 ```

 to load the `MXNet` module. Then we are ready to define the network
 architecture via the [symbolic API](../user-guide/overview.md). We start
 with a placeholder `data` symbol,

 ```julia
 data = mx.Variable(:data)
 ```

 and then cascading fully-connected layers and activation functions:

 ```julia
 fc1  = mx.FullyConnected(data, name=:fc1, num_hidden=128)
 act1 = mx.Activation(fc1, name=:relu1, act_type=:relu)
 fc2  = mx.FullyConnected(act1, name=:fc2, num_hidden=64)
 act2 = mx.Activation(fc2, name=:relu2, act_type=:relu)
 fc3  = mx.FullyConnected(act2, name=:fc3, num_hidden=10)
 ```

 Note each composition we take the previous symbol as the first argument,
 forming a feedforward chain. The architecture looks like

 ```
 Input --> 128 units (ReLU) --> 64 units (ReLU) --> 10 units
 ```

 where the last 10 units correspond to the 10 output classes (digits
 0,...,9). We then add a final `SoftmaxOutput` operation to turn the
 10-dimensional prediction to proper probability values for the 10
 classes:

 ```julia
 mlp  = mx.SoftmaxOutput(fc3, name=:softmax)
 ```

 As we can see, the MLP is just a chain of layers. For this case, we can
 also use the `mx.chain` macro. The same architecture above can be
 defined as

 ```julia
 mlp = @mx.chain mx.Variable(:data)             =>
   mx.FullyConnected(name=:fc1, num_hidden=128) =>
   mx.Activation(name=:relu1, act_type=:relu)   =>
   mx.FullyConnected(name=:fc2, num_hidden=64)  =>
   mx.Activation(name=:relu2, act_type=:relu)   =>
   mx.FullyConnected(name=:fc3, num_hidden=10)  =>
   mx.SoftmaxOutput(name=:softmax)
 ```

 After defining the architecture, we are ready to load the MNIST data.
 MXNet.jl provide built-in data providers for the MNIST dataset, which
 could automatically download the dataset into
 `Pkg.dir("MXNet")/data/mnist` if necessary. We wrap the code to
 construct the data provider into `mnist-data.jl` so that it could be
 shared by both the MLP example and the LeNet ConvNets example.

 ```julia
 batch_size = 100
 include("mnist-data.jl")
 train_provider, eval_provider = get_mnist_providers(batch_size)
 ```

 If you need to write your own data providers for customized data format,
 please refer to [`mx.AbstractDataProvider`](@ref).

 Given the architecture and data, we can instantiate an *model* to do the
 actual training. `mx.FeedForward` is the built-in model that is suitable
 for most feed-forward architectures. When constructing the model, we
 also specify the *context* on which the computation should be carried
 out. Because this is a really tiny MLP, we will just run on a single CPU
 device.

 ```julia
 model = mx.FeedForward(mlp, context=mx.cpu())
 ```

 You can use a `mx.gpu()` or if a list of devices (e.g.
 `[mx.gpu(0), mx.gpu(1)]`) is provided, data-parallelization will be used
 automatically. But for this tiny example, using a GPU device might not
 help.

 The last thing we need to specify is the optimization algorithm (a.k.a.
 *optimizer*) to use. We use the basic SGD with a fixed learning rate 0.1
 , momentum 0.9 and weight decay 0.00001:

 ```julia
 optimizer = mx.SGD(η=0.1, μ=0.9, λ=0.00001)
 ```

 Now we can do the training. Here the `n_epoch` parameter specifies that
 we want to train for 20 epochs. We also supply a `eval_data` to monitor
 validation accuracy on the validation set.

 ```julia
 mx.fit(model, optimizer, train_provider, n_epoch=20, eval_data=eval_provider)
 ```

 Here is a sample output

 ```
 INFO: Start training on [CPU0]
 INFO: Initializing parameters...
 INFO: Creating KVStore...
 INFO: == Epoch 001 ==========
 INFO: ## Training summary
 INFO:       :accuracy = 0.7554
 INFO:            time = 1.3165 seconds
 INFO: ## Validation summary
 INFO:       :accuracy = 0.9502
 ...
 INFO: == Epoch 020 ==========
 INFO: ## Training summary
 INFO:       :accuracy = 0.9949
 INFO:            time = 0.9287 seconds
 INFO: ## Validation summary
 INFO:       :accuracy = 0.9775
 ```

 Convolutional Neural Networks
 -----------------------------

 In the second example, we show a slightly more complicated architecture
 that involves convolution and pooling. This architecture for the MNIST
 is usually called the \[LeNet\]\_. The first part of the architecture is
 listed below:

 ```julia
 # input
 data = mx.Variable(:data)

 # first conv
 conv1 = @mx.chain mx.Convolution(data, kernel=(5,5), num_filter=20)  =>
                   mx.Activation(act_type=:tanh) =>
                   mx.Pooling(pool_type=:max, kernel=(2,2), stride=(2,2))

 # second conv
 conv2 = @mx.chain mx.Convolution(conv1, kernel=(5,5), num_filter=50) =>
                   mx.Activation(act_type=:tanh) =>
                   mx.Pooling(pool_type=:max, kernel=(2,2), stride=(2,2))
 ```

 We basically defined two convolution modules. Each convolution module is
 actually a chain of `Convolution`, `tanh` activation and then max
 `Pooling` operations.

 Each sample in the MNIST dataset is a 28x28 single-channel grayscale
 image. In the tensor format used by `NDArray`, a batch of 100 samples is
 a tensor of shape `(28,28,1,100)`. The convolution and pooling operates
 in the spatial axis, so `kernel=(5,5)` indicate a square region of
 5-width and 5-height. The rest of the architecture follows as:

 ```julia
 # first fully-connected
 fc1   = @mx.chain mx.Flatten(conv2) =>
                   mx.FullyConnected(num_hidden=500) =>
                   mx.Activation(act_type=:tanh)

 # second fully-connected
 fc2   = mx.FullyConnected(fc1, num_hidden=10)

 # softmax loss
 lenet = mx.Softmax(fc2, name=:softmax)
 ```

 Note a fully-connected operator expects the input to be a matrix.
 However, the results from spatial convolution and pooling are 4D
 tensors. So we explicitly used a `Flatten` operator to flat the tensor,
 before connecting it to the `FullyConnected` operator.

 The rest of the network is the same as the previous MLP example. As
 before, we can now load the MNIST dataset:

 ```julia
 batch_size = 100
 include("mnist-data.jl")
 train_provider, eval_provider = get_mnist_providers(batch_size; flat=false)
 ```

 Note we specified `flat=false` to tell the data provider to provide 4D
 tensors instead of 2D matrices because the convolution operators needs
 correct spatial shape information. We then construct a feedforward model
 on GPU, and train it.

 ```julia
 # fit model
 model = mx.FeedForward(lenet, context=mx.gpu())

 # optimizer
 optimizer = mx.SGD(η=0.05, μ=0.9, λ=0.00001)

 # fit parameters
 mx.fit(model, optimizer, train_provider, n_epoch=20, eval_data=eval_provider)
 ```

 And here is a sample of running outputs:

 ```
 INFO: == Epoch 001 ==========
 INFO: ## Training summary
 INFO:       :accuracy = 0.6750
 INFO:            time = 4.9814 seconds
 INFO: ## Validation summary
 INFO:       :accuracy = 0.9712
 ...
 INFO: == Epoch 020 ==========
 INFO: ## Training summary
 INFO:       :accuracy = 1.0000
 INFO:            time = 4.0086 seconds
 INFO: ## Validation summary
 INFO:       :accuracy = 0.9915
 ```

 Predicting with a trained model
 -------------------------------

 Predicting with a trained model is very simple. By calling `mx.predict`
 with the model and a data provider, we get the model output as a Julia
 Array:

 ```julia
 probs = mx.predict(model, eval_provider)
 ```

 The following code shows a stupid way of getting all the labels from the
 data provider, and compute the prediction accuracy manually:

 ```julia
 # collect all labels from eval data
 labels = reduce(
   vcat,
   copy(mx.get(eval_provider, batch, :softmax_label)) for batch ∈ eval_provider)
 # labels are 0...9
 labels .= labels .+ 1

 # Now we use compute the accuracy
 pred = map(i -> argmax(probs[1:10, i]), 1:size(probs, 2))
 correct = sum(pred .== labels)
 @printf "Accuracy on eval set: %.2f%%\n" 100correct/length(labels)
 ```

 Alternatively, when the dataset is huge, one can provide a callback to
 `mx.predict`, then the callback function will be invoked with the
 outputs of each mini-batch. The callback could, for example, write the
 data to disk for future inspection. In this case, no value is returned
 from `mx.predict`. See also predict.
	<!--- Licensed to the Apache Software Foundation (ASF) under one -->
	<!--- or more contributor license agreements. See the NOTICE file -->
	<!--- distributed with this work for additional information -->
	<!--- regarding copyright ownership. The ASF licenses this file -->
	<!--- to you under the Apache License, Version 2.0 (the -->
	<!--- "License"); you may not use this file except in compliance -->
	<!--- with the License. You may obtain a copy of the License at -->

	<!--- http://www.apache.org/licenses/LICENSE-2.0 -->

	<!--- Unless required by applicable law or agreed to in writing, -->
	<!--- software distributed under the License is distributed on an -->
	<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
	<!--- KIND, either express or implied. See the License for the -->
	<!--- specific language governing permissions and limitations -->
	<!--- under the License. -->

	Digit Recognition on MNIST
	==========================

	In this tutorial, we will work through examples of training a simple
	multi-layer perceptron and then a convolutional neural network (the
	LeNet architecture) on the [MNIST handwritten digit
	dataset](http://yann.lecun.com/exdb/mnist/). The code for this tutorial
	could be found in
	[examples/mnist](https://github.com/dmlc/MXNet.jl/tree/master/examples/mnist). There are also two Jupyter notebooks that expand a little more on the [MLP](https://github.com/ultradian/julia_notebooks/blob/master/mxnet/mnistMLP.ipynb) and the [LeNet](https://github.com/ultradian/julia_notebooks/blob/master/mxnet/mnistLenet.ipynb), using the more general `ArrayDataProvider`.

	Simple 3-layer MLP
	------------------

	This is a tiny 3-layer MLP that could be easily trained on CPU. The
	script starts with

	```julia
	using MXNet
	```

	to load the `MXNet` module. Then we are ready to define the network
	architecture via the [symbolic API](../user-guide/overview.md). We start
	with a placeholder `data` symbol,

	```julia
	data = mx.Variable(:data)
	```

	and then cascading fully-connected layers and activation functions:

	```julia
	fc1 = mx.FullyConnected(data, name=:fc1, num_hidden=128)
	act1 = mx.Activation(fc1, name=:relu1, act_type=:relu)
	fc2 = mx.FullyConnected(act1, name=:fc2, num_hidden=64)
	act2 = mx.Activation(fc2, name=:relu2, act_type=:relu)
	fc3 = mx.FullyConnected(act2, name=:fc3, num_hidden=10)
	```

	Note each composition we take the previous symbol as the first argument,
	forming a feedforward chain. The architecture looks like

	```
	Input --> 128 units (ReLU) --> 64 units (ReLU) --> 10 units
	```

	where the last 10 units correspond to the 10 output classes (digits
	0,...,9). We then add a final `SoftmaxOutput` operation to turn the
	10-dimensional prediction to proper probability values for the 10
	classes:

	```julia
	mlp = mx.SoftmaxOutput(fc3, name=:softmax)
	```

	As we can see, the MLP is just a chain of layers. For this case, we can
	also use the `mx.chain` macro. The same architecture above can be
	defined as

	```julia
	mlp = @mx.chain mx.Variable(:data) =>
	mx.FullyConnected(name=:fc1, num_hidden=128) =>
	mx.Activation(name=:relu1, act_type=:relu) =>
	mx.FullyConnected(name=:fc2, num_hidden=64) =>
	mx.Activation(name=:relu2, act_type=:relu) =>
	mx.FullyConnected(name=:fc3, num_hidden=10) =>
	mx.SoftmaxOutput(name=:softmax)
	```

	After defining the architecture, we are ready to load the MNIST data.
	MXNet.jl provide built-in data providers for the MNIST dataset, which
	could automatically download the dataset into
	`Pkg.dir("MXNet")/data/mnist` if necessary. We wrap the code to
	construct the data provider into `mnist-data.jl` so that it could be
	shared by both the MLP example and the LeNet ConvNets example.

	```julia
	batch_size = 100
	include("mnist-data.jl")
	train_provider, eval_provider = get_mnist_providers(batch_size)
	```

	If you need to write your own data providers for customized data format,
	please refer to [`mx.AbstractDataProvider`](@ref).

	Given the architecture and data, we can instantiate an model to do the
	actual training. `mx.FeedForward` is the built-in model that is suitable
	for most feed-forward architectures. When constructing the model, we
	also specify the context on which the computation should be carried
	out. Because this is a really tiny MLP, we will just run on a single CPU
	device.

	```julia
	model = mx.FeedForward(mlp, context=mx.cpu())
	```

	You can use a `mx.gpu()` or if a list of devices (e.g.
	`[mx.gpu(0), mx.gpu(1)]`) is provided, data-parallelization will be used
	automatically. But for this tiny example, using a GPU device might not
	help.

	The last thing we need to specify is the optimization algorithm (a.k.a.
	optimizer) to use. We use the basic SGD with a fixed learning rate 0.1
	, momentum 0.9 and weight decay 0.00001:

	```julia
	optimizer = mx.SGD(η=0.1, μ=0.9, λ=0.00001)
	```

	Now we can do the training. Here the `n_epoch` parameter specifies that
	we want to train for 20 epochs. We also supply a `eval_data` to monitor
	validation accuracy on the validation set.

	```julia
	mx.fit(model, optimizer, train_provider, n_epoch=20, eval_data=eval_provider)
	```

	Here is a sample output

	```
	INFO: Start training on [CPU0]
	INFO: Initializing parameters...
	INFO: Creating KVStore...
	INFO: == Epoch 001 ==========
	INFO: ## Training summary
	INFO: :accuracy = 0.7554
	INFO: time = 1.3165 seconds
	INFO: ## Validation summary
	INFO: :accuracy = 0.9502
	...
	INFO: == Epoch 020 ==========
	INFO: ## Training summary
	INFO: :accuracy = 0.9949
	INFO: time = 0.9287 seconds
	INFO: ## Validation summary
	INFO: :accuracy = 0.9775
	```

	Convolutional Neural Networks
	-----------------------------

	In the second example, we show a slightly more complicated architecture
	that involves convolution and pooling. This architecture for the MNIST
	is usually called the \[LeNet\]\_. The first part of the architecture is
	listed below:

	```julia
	# input
	data = mx.Variable(:data)

	# first conv
	conv1 = @mx.chain mx.Convolution(data, kernel=(5,5), num_filter=20) =>
	mx.Activation(act_type=:tanh) =>
	mx.Pooling(pool_type=:max, kernel=(2,2), stride=(2,2))

	# second conv
	conv2 = @mx.chain mx.Convolution(conv1, kernel=(5,5), num_filter=50) =>
	mx.Activation(act_type=:tanh) =>
	mx.Pooling(pool_type=:max, kernel=(2,2), stride=(2,2))
	```

	We basically defined two convolution modules. Each convolution module is
	actually a chain of `Convolution`, `tanh` activation and then max
	`Pooling` operations.

	Each sample in the MNIST dataset is a 28x28 single-channel grayscale
	image. In the tensor format used by `NDArray`, a batch of 100 samples is
	a tensor of shape `(28,28,1,100)`. The convolution and pooling operates
	in the spatial axis, so `kernel=(5,5)` indicate a square region of
	5-width and 5-height. The rest of the architecture follows as:

	```julia
	# first fully-connected
	fc1 = @mx.chain mx.Flatten(conv2) =>
	mx.FullyConnected(num_hidden=500) =>
	mx.Activation(act_type=:tanh)

	# second fully-connected
	fc2 = mx.FullyConnected(fc1, num_hidden=10)

	# softmax loss
	lenet = mx.Softmax(fc2, name=:softmax)
	```

	Note a fully-connected operator expects the input to be a matrix.
	However, the results from spatial convolution and pooling are 4D
	tensors. So we explicitly used a `Flatten` operator to flat the tensor,
	before connecting it to the `FullyConnected` operator.

	The rest of the network is the same as the previous MLP example. As
	before, we can now load the MNIST dataset:

	```julia
	batch_size = 100
	include("mnist-data.jl")
	train_provider, eval_provider = get_mnist_providers(batch_size; flat=false)
	```

	Note we specified `flat=false` to tell the data provider to provide 4D
	tensors instead of 2D matrices because the convolution operators needs
	correct spatial shape information. We then construct a feedforward model
	on GPU, and train it.

	```julia
	# fit model
	model = mx.FeedForward(lenet, context=mx.gpu())

	# optimizer
	optimizer = mx.SGD(η=0.05, μ=0.9, λ=0.00001)

	# fit parameters
	mx.fit(model, optimizer, train_provider, n_epoch=20, eval_data=eval_provider)
	```

	And here is a sample of running outputs:

	```
	INFO: == Epoch 001 ==========
	INFO: ## Training summary
	INFO: :accuracy = 0.6750
	INFO: time = 4.9814 seconds
	INFO: ## Validation summary
	INFO: :accuracy = 0.9712
	...
	INFO: == Epoch 020 ==========
	INFO: ## Training summary
	INFO: :accuracy = 1.0000
	INFO: time = 4.0086 seconds
	INFO: ## Validation summary
	INFO: :accuracy = 0.9915
	```

	Predicting with a trained model
	-------------------------------

	Predicting with a trained model is very simple. By calling `mx.predict`
	with the model and a data provider, we get the model output as a Julia
	Array:

	```julia
	probs = mx.predict(model, eval_provider)
	```

	The following code shows a stupid way of getting all the labels from the
	data provider, and compute the prediction accuracy manually:

	```julia
	# collect all labels from eval data
	labels = reduce(
	vcat,
	copy(mx.get(eval_provider, batch, :softmax_label)) for batch ∈ eval_provider)
	# labels are 0...9
	labels .= labels .+ 1

	# Now we use compute the accuracy
	pred = map(i -> argmax(probs[1:10, i]), 1:size(probs, 2))
	correct = sum(pred .== labels)
	@printf "Accuracy on eval set: %.2f%%\n" 100correct/length(labels)
	```

	Alternatively, when the dataset is huge, one can provide a callback to
	`mx.predict`, then the callback function will be invoked with the
	outputs of each mini-batch. The callback could, for example, write the
	data to disk for future inspection. In this case, no value is returned
	from `mx.predict`. See also predict.