| # Iterators - Loading data |
| |
| In this tutorial we focus on how to feed data into a training and inference |
| program. Most training and inference modules in MXNet accepts data iterators, |
| which simplifies this procedure, especially when reading large datasets from |
| filesystems. Here we discuss the API conventions and several provided iterators. |
| |
| Data iterators in MXNet is similar to the iterator in Python. In Python, we can |
| use the built-in function `iter` with an iterable object (such as list) to |
| return an iterator. For example, in `x = iter([1, 2, 3])` we obtain an iterator |
| on the list `[1,2,3]`. If we repeatedly call `x.next()` (`__next__()` for Python |
| 3), then we will get elements from the list one by one, and end with a |
| `StopIteration` exception. |
| |
| ## Introduction |
| |
| ### Data Batch |
| |
| A data iterator returns a batch of data in each `next` call. |
| A batch often contains *n* examples and the according labels. Here *n* is |
| called as the batch size. |
| |
| The following codes define a simple data batch that is able to be read by most |
| training/inference modules. |
| |
| ```python |
| class SimpleBatch(object): |
| def __init__(self, data, label, pad=0): |
| self.data = data |
| self.label = label |
| self.pad = pad |
| ``` |
| |
| We explain what each attribute means: |
| |
| - `data` is a list of `NDArray`, each array contains *n* examples. For |
| instance, if an example is presented by a length `k` vector, then the shape of |
| the array will be `(n, k)`. |
| |
| Each array will be copied into a free variable such as created by |
| `mx.sym.Variable()` later. The mapping from arrays to free variables should be |
| given by the `provide_data` attribute of the iterator, which will be discussed |
| shortly. |
| |
| - `label` is also a list of `NDArray`. Often each array is a 1-dimensional |
| array with shape `(n,)`. For classification, each class is represented by an |
| integer starting from 0. |
| |
| - `pad` is an integer which shows the number of examples added in the last of the |
| batch that are merely used for padding. These examples should be ignored in |
| the results, such as computing the gradient. A nonzero padding is often used |
| when we reach the end of the data and the total number of examples cannot be |
| divided by the batch size. |
| |
| ### Data Variables |
| |
| Before showing the data iterator, we first discuss how to find free variables in |
| a symbol. A symbol often contains one or more explicit free variables and also |
| implicit ones. |
| |
| The following code defines a multilayer perceptron. |
| |
| ```python |
| import mxnet as mx |
| net = mx.sym.Variable('data') |
| net = mx.sym.FullyConnected(data=net, name='fc1', num_hidden=64) |
| net = mx.sym.Activation(data=net, name='relu1', act_type="relu") |
| net = mx.sym.FullyConnected(data=net, name='fc2', num_hidden=10) |
| net = mx.sym.SoftmaxOutput(data=net, name='softmax') |
| ``` |
| |
| We can get the names of all the free variables by calling `list_arguments`: |
| |
| ```python |
| net.list_arguments() |
| ``` |
| |
| As can be seen, we name a variable either by its operator's name if it is atomic |
| (e.g. `Variable`) or by the `opname_varname` convention, where `opname` is the |
| operator's name and `varname` is assigned by the operator. The `varname` |
| often means what this variable is for: |
| |
| - `weight` : the weight parameters |
| - `bias` : the bias parameters |
| - `output` : the output |
| - `label` : input label |
| |
| On the above example, now we know that there are 6 variables for free |
| variables. Four of them are learnable parameters, `fc1_weight`, `fc1_bias`, |
| `fc2_weight`, and `fc2_bias`. These parameters are often initialized by |
| `mx.initializer` and updated by `mx.optimizer`. The rest two |
| are for input data: `data` for examples and `softmax_label` for the |
| according labels. Then it is the iterator's job to feed data into these two |
| variables. |
| |
| ### Data iterator |
| |
| An iterator in _MXNet_ should |
| |
| 1. return a data batch or raise a `StopIteration` exception if reaching the end |
| when call `next()` in python 2 or `__next()__` in python 3 |
| 2. has `reset()` method to restart reading from the beginning |
| 3. has `provide_data` and `provide_label` attributes, the former returns a list |
| of `(str, tuple)` pairs, each pair stores an input data variable name and its |
| shape. It is similar for `provide_label`, which provides information about |
| input labels. |
| |
| On the above example, assume the data batch size is *(n,k)* and label size is |
| *(n,1)*, the iterator for `net` should have `provide_data` to return |
| `[('data', (n,k))]` and `provide_label` to return |
| `[('softmax_label', (n,))]`. An training or inference program will then know how |
| to assign the arrays in a data batch into the input variables. |
| |
| ## Read array |
| |
| When data are already in memory and stored by either `NDArray` or numpy ndarray, |
| we can create an iterator by `NDArrayIter`: |
| |
| ```python |
| import numpy as np |
| data = np.random.rand(100,3) |
| label = np.random.randint(0, 10, (100,)) |
| data_iter = mx.io.NDArrayIter(data=data, label=label, batch_size=30) |
| for batch in data_iter: |
| print([batch.data, batch.label, batch.pad]) |
| ``` |
| |
| ## Read CSV |
| |
| There is an iterator called `CSVIter` to read data batches from CSV files. We |
| first dump `data` into a csv file, and then load the data. |
| |
| ```python |
| np.savetxt('data.csv', data, delimiter=',') |
| data_iter = mx.io.CSVIter(data_csv='data.csv', data_shape=(3,), batch_size=30) |
| for batch in data_iter: |
| print([batch.data, batch.pad]) |
| ``` |
| |
| Note that we have not given a label file, then `batch.label` is empty here. |
| |
| ## Read images |
| |
| <!-- TODO(mli) move notebooks here --> |
| |
| - [Read images](https://github.com/dmlc/mxnet-notebooks/blob/master/python/basic/image_io.ipynb) |
| - [Advanced image reading](https://github.com/dmlc/mxnet-notebooks/blob/master/python/basic/advanced_img_io.ipynb) |
| |
| ## Write your own data iterators |
| |
| Sometimes the provided iterators are not enough for some application. There are |
| mainly two ways to develop a new iterator. One is creating from scratch: the |
| following codes define an iterator that creates a given number of data batches |
| through a data generator `data_gen`. |
| |
| ```python |
| import numpy as np |
| class SimpleIter: |
| def __init__(self, data_names, data_shapes, data_gen, |
| label_names, label_shapes, label_gen, num_batches=10): |
| self._provide_data = zip(data_names, data_shapes) |
| self._provide_label = zip(label_names, label_shapes) |
| self.num_batches = num_batches |
| self.data_gen = data_gen |
| self.label_gen = label_gen |
| self.cur_batch = 0 |
| |
| def __iter__(self): |
| return self |
| |
| def reset(self): |
| self.cur_batch = 0 |
| |
| def __next__(self): |
| return self.next() |
| |
| @property |
| def provide_data(self): |
| return self._provide_data |
| |
| @property |
| def provide_label(self): |
| return self._provide_label |
| |
| def next(self): |
| if self.cur_batch < self.num_batches: |
| self.cur_batch += 1 |
| data = [mx.nd.array(g(d[1])) for d,g in zip(self._provide_data, self.data_gen)] |
| label = [mx.nd.array(g(d[1])) for d,g in zip(self._provide_label, self.label_gen)] |
| return SimpleBatch(data, label) |
| else: |
| raise StopIteration |
| ``` |
| |
| But in most cases we can reuse the existing data iterators. For example, in the |
| image caption application, an input example is an image while the label is a |
| sentence. Then we can create |
| |
| - `image_iter` by `ImageRecordIter` so we can enjoy the provided multithreaded |
| pre-fetch and augmentation. |
| - `caption_iter` by `NDArrayIter` or bucketing iterator provided in the `rnn` |
| package. |
| |
| Next we create an iterator whose `next()` function will both |
| `image_iter.next()` and `caption_iter.next()` and return the combined results. |
| |
| |
| <!-- INSERT SOURCE DOWNLOAD BUTTONS --> |