NDArray: NumPy-style Tensor Computations on CPUs and GPUs

In MXNet, NDArray is the basic operational unit for matrix and tensor computations. It is similar to numpy.ndarray, but it has two additional features:

Multiple device support: All operations can be run on various devices, including CPU and GPU cards.
Automatic parallelization: All operations are automatically executed in parallel with each other.

Creation and Initialization

We can create an NDArray on either a CPU or a GPU:

    >>> import mxnet as mx
    >>> a = mx.nd.empty((2, 3)) # create a 2-by-3 matrix on cpu
    >>> b = mx.nd.empty((2, 3), mx.gpu()) # create a 2-by-3 matrix on gpu 0
    >>> c = mx.nd.empty((2, 3), mx.gpu(2)) # create a 2-by-3 matrix on gpu 2
    >>> c.shape # get shape
    (2L, 3L)
    >>> c.context # get device info
    gpu(2)

They can be initialized in various ways:

    >>> a = mx.nd.zeros((2, 3)) # create a 2-by-3 matrix filled with 0
    >>> b = mx.nd.ones((2, 3))  # create a 2-by-3 matrix filled with 1
    >>> b[:] = 2 # set all elements of b to 2

We can copy the value from one NDArray to another, even if they are located on different devices:

    >>> a = mx.nd.ones((2, 3))
    >>> b = mx.nd.zeros((2, 3), mx.gpu())
    >>> a.copyto(b) # copy data from cpu to gpu

We can also convert NDArray to numpy.ndarray:

    >>> a = mx.nd.ones((2, 3))
    >>> b = a.asnumpy()
    >>> type(b)
    <type 'numpy.ndarray'>
    >>> print b
    [[ 1.  1.  1.]
    [ 1.  1.  1.]]

and vice versa:

    >>> import numpy as np
    >>> a = mx.nd.empty((2, 3))
    >>> a[:] = np.random.uniform(-0.1, 0.1, a.shape)
    >>> print a.asnumpy()
    [[-0.06821112 -0.03704893  0.06688045]
     [ 0.09947646 -0.07700162  0.07681718]]

Basic Element-wise Operations

By default, NDArray performs element-wise operations:

    >>> a = mx.nd.ones((2, 3)) * 2
    >>> b = mx.nd.ones((2, 3)) * 4
    >>> print b.asnumpy()
    [[ 4.  4.  4.]
     [ 4.  4.  4.]]
    >>> c = a + b
    >>> print c.asnumpy()
    [[ 6.  6.  6.]
     [ 6.  6.  6.]]
    >>> d = a * b
    >>> print d.asnumpy()
    [[ 8.  8.  8.]
     [ 8.  8.  8.]]

If two NDArrays are located on different devices, we need to explicitly move them onto the same device. The following example performs computations on GPU 0:

    >>> a = mx.nd.ones((2, 3)) * 2
    >>> b = mx.nd.ones((2, 3), mx.gpu()) * 3
    >>> c = a.copyto(mx.gpu()) * b
    >>> print c.asnumpy()
    [[ 6.  6.  6.]
     [ 6.  6.  6.]]

Load and Save

There are two ways to save data to (or load it from) disks easily. The first way uses pickle. NDArray is pickle compatible, which means that you can simply pickle the NDArray as you do with numpy.ndarray:

   >>> import mxnet as mx
   >>> import pickle as pkl

   >>> a = mx.nd.ones((2, 3)) * 2
   >>> data = pkl.dumps(a)
   >>> b = pkl.loads(data)
   >>> print b.asnumpy()
   [[ 2.  2.  2.]
    [ 2.  2.  2.]]

The second way is to directly dump a list of NDArray to disk in binary format:

   >>> a = mx.nd.ones((2,3))*2
   >>> b = mx.nd.ones((2,3))*3
   >>> mx.nd.save('mydata.bin', [a, b])
   >>> c = mx.nd.load('mydata.bin')
   >>> print c[0].asnumpy()
   [[ 2.  2.  2.]
    [ 2.  2.  2.]]
   >>> print c[1].asnumpy()
   [[ 3.  3.  3.]
    [ 3.  3.  3.]]

We can also dump a dict:

   >>> mx.nd.save('mydata.bin', {'a':a, 'b':b})
   >>> c = mx.nd.load('mydata.bin')
   >>> print c['a'].asnumpy()
   [[ 2.  2.  2.]
    [ 2.  2.  2.]]
   >>> print c['b'].asnumpy()
   [[ 3.  3.  3.]
    [ 3.  3.  3.]]

In addition, if we have set up distributed file systems, such as Amazon S3 and HDFS, we can directly save to and load from them. For example:

   >>> mx.nd.save('s3://mybucket/mydata.bin', [a,b])
   >>> mx.nd.save('hdfs///users/myname/mydata.bin', [a,b])

Automatic Parallelization

NDArray can automatically execute operations in parallel. This is desirable when you use multiple resources, such as CPU and GPU cards, and CPU-to-GPU memory bandwidth.

For example, if we write a += 1 followed by b += 1, and a is on a CPU card while b is on a GPU card, then we will want to execute them in parallel to improve the efficiency. Furthermore, data copies between CPU and GPU are expensive, so we want to run them in parallel with other computations.

However, finding statements that can be executed in parallel by eye is hard. In the following example, a+=1 and c*=3 can be executed in parallel, but a+=1 and b*=3 have to be sequentially executed.

   a = mx.nd.ones((2,3))
   b = a
   c = a.copyto(mx.cpu())
   a += 1
   b *= 3
   c *= 3

Luckily, MXNet can automatically resolve the dependencies and execute operations in parallel with correctness guaranteed. In other words, we can write a program as if it is using only a single thread, and MXNet will automatically dispatch it to multiple devices, such as multiple GPU cards or multiple computers.

MXNet achieves this by lazy evaluation. Any operation we write down is issued to a internal engine, and then returned. For example, if we run a += 1, it returns immediately after pushing the plus operation to the engine. This asynchronism allows us to push more operations to the engine, so it can determine the read and write dependency and find the best way to execute operations in parallel.

The actual computations are finished when we copy the results someplace else, such as print a.asnumpy() or mx.nd.save([a]). Therefore, to write highly parallelized code, we only need to postpone asking for the results.

NDArray: NumPy-style Tensor Computations on CPUs and GPUs

Creation and Initialization

Basic Element-wise Operations

Load and Save

Automatic Parallelization

Next Steps