docs/tutorials/python/ndarray.md - mxnet-test - Git at Google

 # NDArray: NumPy-style Tensor Computations on CPUs and GPUs

 In MXNet, `NDArray` is the basic operational unit for matrix and tensor
 computations. It is similar to `numpy.ndarray`, but it has two additional
 features:

 - Multiple device support: All operations can be run on various devices, including CPU and GPU cards.
 - Automatic parallelization: All operations are automatically executed in parallel with each other.

 ## Creation and Initialization

 We can create an `NDArray` on either a CPU or a GPU:

 ```python
     >>> import mxnet as mx
     >>> a = mx.nd.empty((2, 3)) # create a 2-by-3 matrix on cpu
     >>> b = mx.nd.empty((2, 3), mx.gpu()) # create a 2-by-3 matrix on gpu 0
     >>> c = mx.nd.empty((2, 3), mx.gpu(2)) # create a 2-by-3 matrix on gpu 2
     >>> c.shape # get shape
     (2L, 3L)
     >>> c.context # get device info
     gpu(2)
 ```

 They can be initialized in various ways:

 ```python
     >>> a = mx.nd.zeros((2, 3)) # create a 2-by-3 matrix filled with 0
     >>> b = mx.nd.ones((2, 3))  # create a 2-by-3 matrix filled with 1
     >>> b[:] = 2 # set all elements of b to 2
 ```

 We can copy the value from one `NDArray` to another, even if they are located on different devices:

 ```python
     >>> a = mx.nd.ones((2, 3))
     >>> b = mx.nd.zeros((2, 3), mx.gpu())
     >>> a.copyto(b) # copy data from cpu to gpu
 ```

 We can also convert `NDArray` to `numpy.ndarray`:

 ```python
     >>> a = mx.nd.ones((2, 3))
     >>> b = a.asnumpy()
     >>> type(b)
     <type 'numpy.ndarray'>
     >>> print b
     [[ 1.  1.  1.]
     [ 1.  1.  1.]]
 ```

 and vice versa:

 ```python
     >>> import numpy as np
     >>> a = mx.nd.empty((2, 3))
     >>> a[:] = np.random.uniform(-0.1, 0.1, a.shape)
     >>> print a.asnumpy()
     [[-0.06821112 -0.03704893  0.06688045]
      [ 0.09947646 -0.07700162  0.07681718]]
 ```

 ## Basic Element-wise Operations

 By default, `NDArray` performs element-wise operations:

 ```python
     >>> a = mx.nd.ones((2, 3)) * 2
     >>> b = mx.nd.ones((2, 3)) * 4
     >>> print b.asnumpy()
     [[ 4.  4.  4.]
      [ 4.  4.  4.]]
     >>> c = a + b
     >>> print c.asnumpy()
     [[ 6.  6.  6.]
      [ 6.  6.  6.]]
     >>> d = a * b
     >>> print d.asnumpy()
     [[ 8.  8.  8.]
      [ 8.  8.  8.]]
 ```

 If two `NDArray`s are located on different devices, we need to explicitly move them onto the same device. The following example performs computations on GPU 0:

 ```python
     >>> a = mx.nd.ones((2, 3)) * 2
     >>> b = mx.nd.ones((2, 3), mx.gpu()) * 3
     >>> c = a.copyto(mx.gpu()) * b
     >>> print c.asnumpy()
     [[ 6.  6.  6.]
      [ 6.  6.  6.]]
 ```

 ## Load and Save

 There are two ways to save data to (or load it from) disks easily. The first way uses
 `pickle`.  `NDArray` is pickle compatible, which means that you can simply pickle the
 `NDArray` as you do with `numpy.ndarray`:

  ```python
     >>> import mxnet as mx
     >>> import pickle as pkl

     >>> a = mx.nd.ones((2, 3)) * 2
     >>> data = pkl.dumps(a)
     >>> b = pkl.loads(data)
     >>> print b.asnumpy()
     [[ 2.  2.  2.]
      [ 2.  2.  2.]]
  ```

 The second way is to directly dump a list of `NDArray` to disk in binary format:

  ```python
     >>> a = mx.nd.ones((2,3))*2
     >>> b = mx.nd.ones((2,3))*3
     >>> mx.nd.save('mydata.bin', [a, b])
     >>> c = mx.nd.load('mydata.bin')
     >>> print c[0].asnumpy()
     [[ 2.  2.  2.]
      [ 2.  2.  2.]]
     >>> print c[1].asnumpy()
     [[ 3.  3.  3.]
      [ 3.  3.  3.]]
  ```

 We can also dump a dict:

  ```python
     >>> mx.nd.save('mydata.bin', {'a':a, 'b':b})
     >>> c = mx.nd.load('mydata.bin')
     >>> print c['a'].asnumpy()
     [[ 2.  2.  2.]
      [ 2.  2.  2.]]
     >>> print c['b'].asnumpy()
     [[ 3.  3.  3.]
      [ 3.  3.  3.]]
  ```

 In addition, if we have set up distributed file systems, such as Amazon S3 and HDFS, we
 can directly save to and load from them. For example:

  ```python
     >>> mx.nd.save('s3://mybucket/mydata.bin', [a,b])
     >>> mx.nd.save('hdfs///users/myname/mydata.bin', [a,b])
  ```

 ## Automatic Parallelization
 `NDArray` can automatically execute operations in parallel. This is desirable when you
 use multiple resources, such as CPU and GPU cards, and CPU-to-GPU memory bandwidth.

 For example, if we write `a += 1` followed by `b += 1`, and `a` is on a CPU card while
 `b` is on a GPU card, then we will want to execute them in parallel to improve the
 efficiency. Furthermore, data copies between CPU and GPU are expensive, so we
 want to run them in parallel with other computations.

 However, finding statements that can be executed in parallel by eye is hard. In the
 following example, `a+=1` and `c*=3` can be executed in parallel, but `a+=1` and
 `b*=3` have to be sequentially executed.

  ```python
     a = mx.nd.ones((2,3))
     b = a
     c = a.copyto(mx.cpu())
     a += 1
     b *= 3
     c *= 3
  ```

 Luckily, MXNet can automatically resolve the dependencies and
 execute operations in parallel with correctness guaranteed. In other words, we
 can write a program as if it is using only a single thread, and MXNet will
 automatically dispatch it to multiple devices, such as multiple GPU cards or multiple
 computers.

 MXNet achieves this by lazy evaluation. Any operation we write down is issued to a
 internal engine, and then returned. For example, if we run `a += 1`, it
 returns immediately after pushing the plus operation to the engine. This
 asynchronism allows us to push more operations to the engine, so it can determine
 the read and write dependency and find the best way to execute operations in
 parallel.

 The actual computations are finished when we copy the results someplace else, such as `print a.asnumpy()` or `mx.nd.save([a])`. Therefore, to write highly parallelized code, we only need to postpone asking for
 the results.

 ##  Next Steps
 * [Symbol](symbol.md)
 * [KVStore](kvstore.md)
	# NDArray: NumPy-style Tensor Computations on CPUs and GPUs

	In MXNet, `NDArray` is the basic operational unit for matrix and tensor
	computations. It is similar to `numpy.ndarray`, but it has two additional
	features:

	- Multiple device support: All operations can be run on various devices, including CPU and GPU cards.
	- Automatic parallelization: All operations are automatically executed in parallel with each other.

	## Creation and Initialization

	We can create an `NDArray` on either a CPU or a GPU:

	```python
	>>> import mxnet as mx
	>>> a = mx.nd.empty((2, 3)) # create a 2-by-3 matrix on cpu
	>>> b = mx.nd.empty((2, 3), mx.gpu()) # create a 2-by-3 matrix on gpu 0
	>>> c = mx.nd.empty((2, 3), mx.gpu(2)) # create a 2-by-3 matrix on gpu 2
	>>> c.shape # get shape
	(2L, 3L)
	>>> c.context # get device info
	gpu(2)
	```

	They can be initialized in various ways:

	```python
	>>> a = mx.nd.zeros((2, 3)) # create a 2-by-3 matrix filled with 0
	>>> b = mx.nd.ones((2, 3)) # create a 2-by-3 matrix filled with 1
	>>> b[:] = 2 # set all elements of b to 2
	```

	We can copy the value from one `NDArray` to another, even if they are located on different devices:

	```python
	>>> a = mx.nd.ones((2, 3))
	>>> b = mx.nd.zeros((2, 3), mx.gpu())
	>>> a.copyto(b) # copy data from cpu to gpu
	```

	We can also convert `NDArray` to `numpy.ndarray`:

	```python
	>>> a = mx.nd.ones((2, 3))
	>>> b = a.asnumpy()
	>>> type(b)
	<type 'numpy.ndarray'>
	>>> print b
	[[ 1. 1. 1.]
	[ 1. 1. 1.]]
	```

	and vice versa:

	```python
	>>> import numpy as np
	>>> a = mx.nd.empty((2, 3))
	>>> a[:] = np.random.uniform(-0.1, 0.1, a.shape)
	>>> print a.asnumpy()
	[[-0.06821112 -0.03704893 0.06688045]
	[ 0.09947646 -0.07700162 0.07681718]]
	```

	## Basic Element-wise Operations

	By default, `NDArray` performs element-wise operations:

	```python
	>>> a = mx.nd.ones((2, 3)) * 2
	>>> b = mx.nd.ones((2, 3)) * 4
	>>> print b.asnumpy()
	[[ 4. 4. 4.]
	[ 4. 4. 4.]]
	>>> c = a + b
	>>> print c.asnumpy()
	[[ 6. 6. 6.]
	[ 6. 6. 6.]]
	>>> d = a * b
	>>> print d.asnumpy()
	[[ 8. 8. 8.]
	[ 8. 8. 8.]]
	```

	If two `NDArray`s are located on different devices, we need to explicitly move them onto the same device. The following example performs computations on GPU 0:

	```python
	>>> a = mx.nd.ones((2, 3)) * 2
	>>> b = mx.nd.ones((2, 3), mx.gpu()) * 3
	>>> c = a.copyto(mx.gpu()) * b
	>>> print c.asnumpy()
	[[ 6. 6. 6.]
	[ 6. 6. 6.]]
	```

	## Load and Save

	There are two ways to save data to (or load it from) disks easily. The first way uses
	`pickle`. `NDArray` is pickle compatible, which means that you can simply pickle the
	`NDArray` as you do with `numpy.ndarray`:

	```python
	>>> import mxnet as mx
	>>> import pickle as pkl

	>>> a = mx.nd.ones((2, 3)) * 2
	>>> data = pkl.dumps(a)
	>>> b = pkl.loads(data)
	>>> print b.asnumpy()
	[[ 2. 2. 2.]
	[ 2. 2. 2.]]
	```

	The second way is to directly dump a list of `NDArray` to disk in binary format:

	```python
	>>> a = mx.nd.ones((2,3))*2
	>>> b = mx.nd.ones((2,3))*3
	>>> mx.nd.save('mydata.bin', [a, b])
	>>> c = mx.nd.load('mydata.bin')
	>>> print c[0].asnumpy()
	[[ 2. 2. 2.]
	[ 2. 2. 2.]]
	>>> print c[1].asnumpy()
	[[ 3. 3. 3.]
	[ 3. 3. 3.]]
	```

	We can also dump a dict:

	```python
	>>> mx.nd.save('mydata.bin', {'a':a, 'b':b})
	>>> c = mx.nd.load('mydata.bin')
	>>> print c['a'].asnumpy()
	[[ 2. 2. 2.]
	[ 2. 2. 2.]]
	>>> print c['b'].asnumpy()
	[[ 3. 3. 3.]
	[ 3. 3. 3.]]
	```

	In addition, if we have set up distributed file systems, such as Amazon S3 and HDFS, we
	can directly save to and load from them. For example:

	```python
	>>> mx.nd.save('s3://mybucket/mydata.bin', [a,b])
	>>> mx.nd.save('hdfs///users/myname/mydata.bin', [a,b])
	```

	## Automatic Parallelization
	`NDArray` can automatically execute operations in parallel. This is desirable when you
	use multiple resources, such as CPU and GPU cards, and CPU-to-GPU memory bandwidth.

	For example, if we write `a += 1` followed by `b += 1`, and `a` is on a CPU card while
	`b` is on a GPU card, then we will want to execute them in parallel to improve the
	efficiency. Furthermore, data copies between CPU and GPU are expensive, so we
	want to run them in parallel with other computations.

	However, finding statements that can be executed in parallel by eye is hard. In the
	following example, `a+=1` and `c*=3` can be executed in parallel, but `a+=1` and
	`b*=3` have to be sequentially executed.

	```python
	a = mx.nd.ones((2,3))
	b = a
	c = a.copyto(mx.cpu())
	a += 1
	b *= 3
	c *= 3
	```

	Luckily, MXNet can automatically resolve the dependencies and
	execute operations in parallel with correctness guaranteed. In other words, we
	can write a program as if it is using only a single thread, and MXNet will
	automatically dispatch it to multiple devices, such as multiple GPU cards or multiple
	computers.

	MXNet achieves this by lazy evaluation. Any operation we write down is issued to a
	internal engine, and then returned. For example, if we run `a += 1`, it
	returns immediately after pushing the plus operation to the engine. This
	asynchronism allows us to push more operations to the engine, so it can determine
	the read and write dependency and find the best way to execute operations in
	parallel.

	The actual computations are finished when we copy the results someplace else, such as `print a.asnumpy()` or `mx.nd.save([a])`. Therefore, to write highly parallelized code, we only need to postpone asking for
	the results.

	## Next Steps
	* [Symbol](symbol.md)
	* [KVStore](kvstore.md)