layout: page_category title: Using MXNet with Large Tensor Support category: faq faq_c: Extend and Contribute to MXNet question: How do I use MXNet built with Large Tensor Support permalink: /api/faq/large_tensor_support

Using MXNet with Large Tensor Support

What is large tensor support?

When creating a network that uses large amounts of data, as in a deep graph problem, you may need large tensor support. This means tensors are indexed using INT64, instead of INT32 indices.

This feature is enabled when MXNet is built with a flag USE_INT64_TENSOR_SIZE=1, which is now a default setting. You can also make MXNet use INT32 indices by changing this flag.

When do you need it?

  1. When you are creating NDArrays of size larger than 2^31 elements.
  2. When the input to your model requires tensors that have inputs larger than 2^31 (when you load them all at once in your code) or attributes greater than 2^31.

How to identify that you need to use large tensors ?

When you see one of the following errors:

  1. OverflowError: unsigned int is greater than maximum
  2. Check failed: inp->shape().Size() < 1 >> 31 (4300000000 vs. 0) : Size of tensor you are trying to allocate is larger than 2^32 elements. Please build with flag USE_INT64_TENSOR_SIZE=1
  3. Invalid Parameter format for end expect int or None but value=‘2150000000’, in operator slice_axis(name="", end=“2150000000”, begin=“0”, axis=“0”). Basically input attribute was expected to be int32, which is less than 2^31 and the received value is larger than that so, operator's parmeter inference treats that as a string which becomes unexpected input.`

How to use it ?

You can create a large NDArray that requires large tensor enabled build to run as follows:

LARGE_X=4300000000
a = mx.nd.arange(0, LARGE_X, dtype=“int64”)
or
a = nd.ones(shape=LARGE_X)
or
a = nd.empty(LARGE_X)
or
a = nd.random.exponential(shape=LARGE_X)
or
a = nd.random.gamma(shape=LARGE_X)
or
a = nd.random.normal(shape=LARGE_X)

Caveats

  1. Use int64 as dtype whenever attempting to slice an NDArray when range is over maximum int32 value
  2. Use int64 as dtype when passing indices as parameters or expecting output as parameters to and from operators

The following are the cases for large tensor usage where you must specify dtype as int64:

  • randint():
low_large_value = 2**32
high_large_value = 2**34
# dtype is explicitly specified since default type is int32 for randint
a = nd.random.randint(low_large_value, high_large_value, dtype=np.int64)
  • ravel_multi_index() and unravel_index():
x1, y1 = rand_coord_2d((LARGE_X - 100), LARGE_X, 10, SMALL_Y)
x2, y2 = rand_coord_2d((LARGE_X - 200), LARGE_X, 9, SMALL_Y)
x3, y3 = rand_coord_2d((LARGE_X - 300), LARGE_X, 8, SMALL_Y)
indices_2d = [[x1, x2, x3], [y1, y2, y3]]
# dtype is explicitly specified for indices else they will default to float32
idx = mx.nd.ravel_multi_index(mx.nd.array(indices_2d, dtype=np.int64),
                                  shape=(LARGE_X, SMALL_Y))
indices_2d = mx.nd.unravel_index(mx.nd.array(idx_numpy, dtype=np.int64),
                                  shape=(LARGE_X, SMALL_Y))
  • argsort() and topk()

They both return indices which are specified by dtype=np.int64.

b = create_2d_tensor(rows=LARGE_X, columns=SMALL_Y)
# argsort
s = nd.argsort(b, axis=0, is_ascend=False, dtype=np.int64)
# topk
k = nd.topk(b, k=10, axis=0, dtype=np.int64)
  • index_copy()

Again whenever we are passing indices as arguments and using large tensor, the dtype of indices must be int64.

x = mx.nd.zeros((LARGE_X, SMALL_Y))
t = mx.nd.arange(1, SMALL_Y + 1).reshape((1, SMALL_Y))
# explicitly specifying dtype of indices to np.int64
index = mx.nd.array([LARGE_X - 1], dtype="int64")
x = mx.nd.contrib.index_copy(x, index, t)
  • one_hot()

Here again array is used as indices that act as location of bits inside the large vector that need to be activated.

# a is the index array here whose dtype should be int64.
a = nd.array([1, (VLARGE_X - 1)], dtype=np.int64)
b = nd.one_hot(a, VLARGE_X)

What platforms and version of MXNet are supported ?

You can use MXNet with large tensor support in the following configuration:

MXNet built for CPU on Linux (Ubuntu or Amazon Linux), and only for python bindings. Custom wheels are provided with this configuration.

These flavors of MXNet are currently built with large tensor support:

  1. MXNet for linux-cpu
  2. MXNet for linux_cu100

Large tensor support only works for forward pass. Backward pass is partially supported and not completely tested, so it is considered experimental at best.

Not supported:

  • GPU.
  • Windows, ARM or any operating system other than Ubuntu
  • Other language bindings like Scala, Java, R, and Julia.

Other known Issues:

a = mx.sym.Variable('a')
b = mx.sym.Variable('b')
c = 2 * a + b
texec = c.bind(mx.cpu(), {'a': nd.arange(0, LARGE_X * 2, dtype='int64').reshape(2, LARGE_X), 'b' : nd.arange(0, LARGE_X * 2, dtype='int64').reshape(2, LARGE_X)})
new_shape = {'a': (1, 2*LARGE_X), 'b': (1, 2*LARGE_X)}
texec.reshape(allow_up_sizing=True, **new_shape)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/mxnet/python/mxnet/executor.py", line 449, in reshape
    py_array('i', provided_arg_shape_data)),
OverflowError: signed integer is greater than maximum}

Symbolic reshape is not supported. Please see the following example.

a = mx.sym.Variable('a')
b = mx.sym.Variable('b')
c = 2 * a + b
texec = c.bind(mx.cpu(), {'a': nd.arange(0, LARGE_X * 2, dtype='int64').reshape(2, LARGE_X), 'b' : nd.arange(0, LARGE_X * 2, dtype='int64').reshape(2, LARGE_X)})
new_shape = {'a': (1, 2 * LARGE_X), 'b': (1, 2 * LARGE_X)}
texec.reshape(allow_up_sizing=True, **new_shape)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/mxnet/python/mxnet/executor.py", line 449, in reshape
    py_array('i', provided_arg_shape_data)),
OverflowError: signed integer is greater than maximum

Working DGL Example(dgl.ai)

The following is a sample running code for DGL which works with int64 but not with int32.

import mxnet as mx
from mxnet import gluon
import dgl
import dgl.function as fn
import numpy as np
from scipy import sparse as spsp

num_nodes = 10000000
num_edges = 100000000

col1 = np.random.randint(0, num_nodes, size=(num_edges,))
print('create col1')
col2 = np.random.randint(0, num_nodes, size=(num_edges,))
print('create col2')
data = np.ones((num_edges,))
print('create data')
spm = spsp.coo_matrix((data, (col1, col2)), shape=(num_nodes, num_nodes))
print('create coo')
labels = mx.nd.random.randint(0, 10, shape=(num_nodes,))

g = dgl.DGLGraph(spm, readonly=True)
print('create DGLGraph')
g.ndata['h'] = mx.nd.random.uniform(shape=(num_nodes, 200))
print('create node data')

class node_update(gluon.Block):
    def __init__(self, out_feats):
        super(node_update, self).__init__()
        self.dense = gluon.nn.Dense(out_feats, 'relu')
        self.dropout = 0.5

    def forward(self, nodes):
        h = mx.nd.concat(nodes.data['h'], nodes.data['accum'], dim=1)
        h = self.dense(h)
        return {'h': mx.nd.Dropout(h, p=self.dropout)}
update_fn = node_update(200)
update_fn.initialize(ctx=mx.cpu())

g.update_all(fn.copy_src(src='h', out='m'), fn.sum(msg='m', out='accum'), update_fn)
print('update all')

loss_fcn = gluon.loss.SoftmaxCELoss()
loss = loss_fcn(g.ndata['h'], labels)
print('loss')
loss = loss.sum()
print(loss)

Performance Regression:

Roughly 40 operators have shown performance regression in our preliminary analysis: Large Tensor Performance as shown in table below.

Operatorint32(msec)int64(msec)int64/int32int32+mkl(msec)int64+mkl(msec)int64+mkl/int32+mkl
topk12.8124519842.2472195329.74%12.72802743.462353341.47%
argsort16.4389680146.2231455281.18%17.20031146.7779985271.96%
sort16.5782275146.5644815280.88%16.40123646.263803282.08%
flip0.2218175210.535838241.57%0.21237050.7950055374.35%
depth_to_space0.2509769980.534083212.80%0.23381550.631252269.98%
space_to_depth0.2543365120.5368935211.10%0.23344050.6343175271.73%
min_axis0.6858265261.4393255209.87%0.62661751.3538925216.06%
sum_axis0.7208095051.5110635209.63%0.65662650.8290575126.26%
nansum1.2793370122.635434206.00%1.2271562.4305255198.06%
argmax4.7651469949.682672203.20%4.65766059.394067201.69%
swapaxes0.6679430081.3544455202.78%0.6490361.8293235281.85%
argmin4.7748904919.545651199.91%4.6668589.5194385203.98%
sum_axis0.5402109821.0550705195.31%0.5008950.616179123.02%
max_axis0.1178240050.226481192.22%0.1490850.224334150.47%
argmax_channel0.2618970180.49573189.28%0.2511710.4814885191.70%
min_axis0.1476985050.2675355181.14%0.1484240.2874105193.64%
nansum1.1421320092.058077180.20%1.0423871.263102121.17%
min_axis0.569519471.020972179.27%0.47225950.998179211.36%
min1.1546844912.0446045177.07%1.05341451.9723065187.23%
sum1.1217534771.959272174.66%0.99840951.213339121.53%
sum_axis0.1586324940.2744115172.99%0.15737350.2266315144.01%
nansum0.214181520.3661335170.95%0.21629350.269517124.61%
random_normal1.2290724842.093057170.30%1.2227852.095916171.41%
LeakyReLU0.3441014850.582337169.23%0.3891670.7003465179.96%
nanprod1.2732655162.095068164.54%1.09068152.054369188.36%
nanprod0.2032724730.32792161.32%0.2025480.3288335162.35%
sample_gamma8.07996201912.7266385157.51%12.421624512.7957475103.01%
sum0.215716020.3396875157.47%0.19399950.262942135.54%
argmin0.0863814780.1354795156.84%0.08262350.134886163.25%
argmax0.086649030.135826156.75%0.0826930.1269225153.49%
sample_gamma7.71284350812.0266355155.93%11.890091512.143009102.13%
sample_exponential2.3127783.5953945155.46%3.09350853.5656265115.26%
prod0.2031709880.3113865153.26%0.1807570.264523146.34%
random_uniform0.408937980.6240795152.61%0.2446130.6319695258.35%
min0.2054825020.3122025151.94%0.20238350.33234164.21%
random_negative_binomial3.9192285045.919488151.04%5.6858516.0220735105.91%
max0.2125210010.3130105147.28%0.20397550.2956105144.92%
LeakyReLU2.8134240134.1121625146.16%2.7191185.613753206.45%
mean0.2422815010.344385142.14%0.2093960.313411149.67%
Deconvolution7.4327925110.4240845140.24%2.95489255.812926196.72%
abs0.2732864810.38319140.22%0.37116150.33806491.08%
arcsinh0.1557925130.2090985134.22%0.1133650.1702855150.21%
sample_gamma0.1376349830.1842455133.87%0.17928250.17217596.04%
sort0.8641070161.1560165133.78%0.82392851.1454645139.02%
argsort0.8472595071.1320885133.62%0.8423021.1179105132.72%
cosh0.1299474970.1727415132.93%0.11925650.1217325102.08%
random_randint0.8220445311.085645132.07%0.60368051.0953995181.45%
arctanh0.1198179960.1576315131.56%0.1156160.11190796.79%
arccos0.1856625020.2423095130.51%0.2385340.235141598.58%
mean1.7585134772.2908485130.27%1.58684652.530801159.49%
erfinv0.1424985240.184796129.68%0.15290250.1538225100.60%
degrees0.125172490.1576175125.92%0.11664250.1199775102.86%
sample_exponential0.076518510.0960485125.52%0.08857750.095597107.92%
arctan0.1208635220.1496115123.79%0.11612450.17206148.17%
prod1.1476950021.408007122.68%1.04910251.4065515134.07%
fix0.0734369970.089991122.54%0.03904550.099307254.34%
exp0.0477019930.058272122.16%0.03972950.0506725127.54%