Data Loading API

Overview

This document summarizes supported data formats and iterator APIs to read the data including

.. autosummary::
    :nosignatures:

    mxnet.io
    mxnet.recordio
    mxnet.image

First, let's see how to write an iterator for a new data format. The following iterator can be used to train a symbol whose input data variable has name data and input label variable has name softmax_label. The iterator also provides information about the batch, including the shapes and name.

>>> nd_iter = mx.io.NDArrayIter(data={'data':mx.nd.ones((100,10))},
...                             label={'softmax_label':mx.nd.ones((100,))},
...                             batch_size=25)
>>> print(nd_iter.provide_data)
[DataDesc[data,(25, 10L),<type 'numpy.float32'>,NCHW]]
>>> print(nd_iter.provide_label)
[DataDesc[softmax_label,(25,),<type 'numpy.float32'>,NCHW]]

Let's see a complete example of how to use data iterator in model training.

>>> data = mx.sym.Variable('data')
>>> label = mx.sym.Variable('softmax_label')
>>> fullc = mx.sym.FullyConnected(data=data, num_hidden=1)
>>> loss = mx.sym.SoftmaxOutput(data=data, label=label)
>>> mod = mx.mod.Module(loss, data_names=['data'], label_names=['softmax_label'])
>>> mod.bind(data_shapes=nd_iter.provide_data, label_shapes=nd_iter.provide_label)
>>> mod.fit(nd_iter, num_epoch=2)

A detailed tutorial is available at Iterators - Loading data.

Data iterators

    .. currentmodule:: mxnet
.. autosummary::
    :nosignatures:

    io.NDArrayIter
    io.CSVIter
    io.ImageRecordIter
    io.ImageRecordUInt8Iter
    io.MNISTIter
    recordio.MXRecordIO
    recordio.MXIndexedRecordIO
    image.ImageIter

Helper classes and functions

Data structures and other iterators provided in the mxnet.io packages.

.. autosummary::
    :nosignatures:

    io.DataDesc
    io.DataBatch
    io.DataIter
    io.ResizeIter
    io.PrefetchingIter
    io.MXDataIter

A list of image modification functions provided by mxnet.image.

.. autosummary::
    :nosignatures:

    image.imdecode
    image.scale_down
    image.resize_short
    image.fixed_crop
    image.random_crop
    image.center_crop
    image.color_normalize
    image.random_size_crop
    image.ResizeAug
    image.RandomCropAug
    image.RandomSizedCropAug
    image.CenterCropAug
    image.RandomOrderAug
    image.ColorJitterAug
    image.LightingAug
    image.ColorNormalizeAug
    image.HorizontalFlipAug
    image.CastAug
    image.CreateAugmenter

Functions to read and write RecordIO files.

.. autosummary::
    :nosignatures:

    recordio.pack
    recordio.unpack
    recordio.unpack_img
    recordio.pack_img

Develop a new iterator

Writing a new data iterator in Python is straightforward. Most MXNet training/inference programs accept an iterable object with provide_data and provide_label properties. This tutorial explains how to write an iterator from scratch.

The following example demonstrates how to combine multiple data iterators into a single one. It can be used for multiple modality training such as image captioning, in which images are read by ImageRecordIter while documents are read by CSVIter

class MultiIter:
    def __init__(self, iter_list):
        self.iters = iter_list
    def next(self):
        batches = [i.next() for i in self.iters]
        return DataBatch(data=[*b.data for b in batches],
                         label=[*b.label for b in batches])
    def reset(self):
        for i in self.iters:
            i.reset()
    @property
    def provide_data(self):
        return [*i.provide_data for i in self.iters]
    @property
    def provide_label(self):
        return [*i.provide_label for i in self.iters]

iter = MultiIter([mx.io.ImageRecordIter('image.rec'), mx.io.CSVIter('txt.csv')])

Parsing and performing another pre-processing such as augmentation may be expensive. If performance is critical, we can implement a data iterator in C++. Refer to src/io for examples.

Change batch layout

By default, the backend engine treats the first dimension of each data and label variable in data iterators as the batch size (i.e. NCHW or NT layout). In order to override the axis for batch size, the provide_data (and provide_label if there is label) properties should include the layouts. This is especially useful in RNN since TNC layouts are often more efficient. For example:

@property
def provide_data(self):
    return [DataDesc(name='seq_var', shape=(seq_length, batch_size), layout='TN')]

The backend engine will recognize the index of N in the layout as the axis for batch size.

API Reference

.. automodule:: mxnet.io
    :members:
.. automodule:: mxnet.image
    :members:
.. automodule:: mxnet.recordio
    :members: