blob: a47147c9437a523ce01d63a858a3e4a6f0ed9474 [file] [log] [blame]
MXNet R Tutorial on NDArray and Symbol
======================================
This vignette gives a general overview of MXNet"s R package. MXNet contains a
mixed flavor of elements to bake flexible and efficient
applications. There are two major concepts introduced in this tutorial.
* [NDArray](#ndarray-numpy-style-tensor-computations-on-cpus-and-gpus)
offers matrix and tensor computations on both CPU and GPU, with automatic
parallelization
* [Symbol](#symbol-and-automatic-differentiation) makes defining a neural
network extremely easy, and provides automatic differentiation.
## NDArray: Vectorized tensor computations on CPUs and GPUs
`NDArray` is the basic vectorized operation unit in MXNet for matrix and tensor computations.
Users can perform usual calculations as on R"s array, but with two additional features:
1. **multiple devices**: all operations can be run on various devices including
CPU and GPU
2. **automatic parallelization**: all operations are automatically executed in
parallel with each other
### Create and Initialization
Let"s create `NDArray` on either GPU or CPU
```{r}
require(mxnet)
a <- mx.nd.zeros(c(2, 3)) # create a 2-by-3 matrix on cpu
b <- mx.nd.zeros(c(2, 3), mx.cpu()) # create a 2-by-3 matrix on cpu
# c <- mx.nd.zeros(c(2, 3), mx.gpu(0)) # create a 2-by-3 matrix on gpu 0, if you have CUA enabled.
```
As a side note, normally for CUDA enabled devices, the device id of GPU starts from 0.
So that is why we passed in 0 to GPU id. We can also initialize an `NDArray` object in various ways:
```{r}
a <- mx.nd.ones(c(4, 4))
b <- mx.rnorm(c(4, 5))
c <- mx.nd.array(1:5)
```
To check the numbers in an `NDArray`, we can simply run
```{r}
a <- mx.nd.ones(c(2, 3))
b <- as.array(a)
class(b)
b
```
### Basic Operations
#### Elemental-wise operations
You can perform elemental-wise operations on `NDArray` objects:
```{r}
a <- mx.nd.ones(c(2, 4)) * 2
b <- mx.nd.ones(c(2, 4)) / 8
as.array(a)
as.array(b)
c <- a + b
as.array(c)
d <- c / a - 5
as.array(d)
```
If two `NDArray`s sit on different devices, we need to explicitly move them
into the same one. For instance:
```{r, eval=FALSE}
a <- mx.nd.ones(c(2, 3)) * 2
b <- mx.nd.ones(c(2, 3), mx.gpu()) / 8
c <- mx.nd.copyto(a, mx.gpu()) * b
as.array(c)
```
#### Load and Save
You can save a list of `NDArray` object to your disk with `mx.nd.save`:
```{r}
a <- mx.nd.ones(c(2, 3))
mx.nd.save(a, "temp.ndarray")
```
You can also load it back easily:
```{r}
a <- mx.nd.load("temp.ndarray")
as.array(a[[1]])
```
In case you want to save data to the distributed file system such as S3 and HDFS,
we can directly save to and load from them. For example:
```{r,eval=FALSE}
mx.nd.save(list(a), "s3://mybucket/mydata.bin")
mx.nd.save(list(a), "hdfs///users/myname/mydata.bin")
```
### Automatic Parallelization
`NDArray` can automatically execute operations in parallel. It is desirable when we
use multiple resources such as CPU, GPU cards, and CPU-to-GPU memory bandwidth.
For example, if we write `a <- a + 1` followed by `b <- b + 1`, and `a` is on CPU while
`b` is on GPU, then want to execute them in parallel to improve the
efficiency. Furthermore, data copy between CPU and GPU are also expensive, we
hope to run it parallel with other computations as well.
However, finding the codes can be executed in parallel by eye is hard. In the
following example, `a <- a + 1` and `c <- c * 3` can be executed in parallel, but `a <- a + 1` and
`b <- b * 3` should be in sequential.
```{r}
a <- mx.nd.ones(c(2,3))
b <- a
c <- mx.nd.copyto(a, mx.cpu())
a <- a + 1
b <- b * 3
c <- c * 3
```
Luckily, MXNet can automatically resolve the dependencies and
execute operations in parallel with correctness guaranteed. In other words, we
can write program as by assuming there is only a single thread, while MXNet will
automatically dispatch it into multi-devices, such as multi GPU cards or multi
machines.
It is achieved by lazy evaluation. Any operation we write down is issued into a
internal engine, and then returned. For example, if we run `a <- a + 1`, it
returns immediately after pushing the plus operator to the engine. This
asynchronous allows us to push more operators to the engine, so it can determine
the read and write dependency and find a best way to execute them in
parallel.
The actual computations are finished if we want to copy the results into some
other place, such as `as.array(a)` or `mx.nd.save(a, "temp.dat")`. Therefore, if we
want to write highly parallelized codes, we only need to postpone when we need
the results.
## Symbol and Automatic Differentiation
WIth the computational unit `NDArray`, we need a way to construct neural networks. MXNet provides a symbolic interface named Symbol to do so. The symbol combines both flexibility and efficiency.
### Basic Composition of Symbols
The following codes create a two layer perceptrons network:
```{r}
require(mxnet)
net <- mx.symbol.Variable("data")
net <- mx.symbol.FullyConnected(data=net, name="fc1", num_hidden=128)
net <- mx.symbol.Activation(data=net, name="relu1", act_type="relu")
net <- mx.symbol.FullyConnected(data=net, name="fc2", num_hidden=64)
net <- mx.symbol.SoftmaxOutput(data=net, name="out")
class(net)
```
Each symbol takes a (unique) string name. *Variable* often defines the inputs,
or free variables. Other symbols take a symbol as the input (*data*),
and may accept other hyper-parameters such as the number of hidden neurons (*num_hidden*)
or the activation type (*act_type*).
The symbol can be simply viewed as a function taking several arguments, whose
names are automatically generated and can be get by
```{r}
arguments(net)
```
As can be seen, these arguments are the parameters need by each symbol:
- *data* : input data needed by the variable *data*
- *fc1_weight* and *fc1_bias* : the weight and bias for the first fully connected layer *fc1*
- *fc2_weight* and *fc2_bias* : the weight and bias for the second fully connected layer *fc2*
- *out_label* : the label needed by the loss
We can also specify the automatic generated names explicitly:
```{r}
data <- mx.symbol.Variable("data")
w <- mx.symbol.Variable("myweight")
net <- mx.symbol.FullyConnected(data=data, weight=w, name="fc1", num_hidden=128)
arguments(net)
```
### More Complicated Composition
MXNet provides well-optimized symbols for
commonly used layers in deep learning. We can also easily define new operators
in python. The following example first performs an elementwise add between two
symbols, then feed them to the fully connected operator.
```{r}
lhs <- mx.symbol.Variable("data1")
rhs <- mx.symbol.Variable("data2")
net <- mx.symbol.FullyConnected(data=lhs + rhs, name="fc1", num_hidden=128)
arguments(net)
```
We can also construct symbol in a more flexible way rather than the single
forward composition we addressed before.
```{r}
net <- mx.symbol.Variable("data")
net <- mx.symbol.FullyConnected(data=net, name="fc1", num_hidden=128)
net2 <- mx.symbol.Variable("data2")
net2 <- mx.symbol.FullyConnected(data=net2, name="net2", num_hidden=128)
composed.net <- mx.apply(net, data=net2, name="compose")
arguments(composed.net)
```
In the above example, *net* is used a function to apply to an existing symbol
*net*, the resulting *composed.net* will replace the original argument *data* by
*net2* instead.
### Training a Neural Net.
The [model API](../../R-package/R/model.R) is a thin wrapper around the symbolic executors to support neural net training.
You are also highly encouraged to read [Symbolic Configuration and Execution in Pictures for python package](../python/symbol_in_pictures.md),
which provides a detailed explanation of concepts in pictures.
### How Efficient is Symbolic API
In short, they are designed to be very efficient in both memory and runtime.
The major reason for us to introduce Symbolic API, is to bring the efficient C++
operations in powerful toolkits such as cxxnet and caffe together with the
flexible dynamic NArray operations. All the memory and computation resources are
allocated statically during Bind, to maximize the runtime performance and memory
utilization.
The coarse grained operators are equivalent to cxxnet layers, which are
extremely efficient. We also provide fine grained operators for more flexible
composition. Because we are also doing more inplace memory allocation, mxnet can
be ***more memory efficient*** than cxxnet, and gets to same runtime, with
greater flexiblity.