docs/api/python/io/io.md - mxnet - Git at Google

 <!--- Licensed to the Apache Software Foundation (ASF) under one -->
 <!--- or more contributor license agreements.  See the NOTICE file -->
 <!--- distributed with this work for additional information -->
 <!--- regarding copyright ownership.  The ASF licenses this file -->
 <!--- to you under the Apache License, Version 2.0 (the -->
 <!--- "License"); you may not use this file except in compliance -->
 <!--- with the License.  You may obtain a copy of the License at -->

 <!---   http://www.apache.org/licenses/LICENSE-2.0 -->

 <!--- Unless required by applicable law or agreed to in writing, -->
 <!--- software distributed under the License is distributed on an -->
 <!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
 <!--- KIND, either express or implied.  See the License for the -->
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->

 # Data Loading API

 ## Overview

 This document summarizes supported data formats and iterator APIs to read the
 data including:

 ```eval_rst
 .. autosummary::
     :nosignatures:

     mxnet.io
     mxnet.recordio
     mxnet.image
 ```

 First, let's see how to write an iterator for a new data format.
 The following iterator can be used to train a symbol whose input data variable has
 name `data` and input label variable has name `softmax_label`.
 The iterator also provides information about the batch, including the
 shapes and name.

 ```python
 >>> nd_iter = mx.io.NDArrayIter(data={'data':mx.nd.ones((100,10))},
 ...                             label={'softmax_label':mx.nd.ones((100,))},
 ...                             batch_size=25)
 >>> print(nd_iter.provide_data)
 [DataDesc[data,(25, 10L),<type 'numpy.float32'>,NCHW]]
 >>> print(nd_iter.provide_label)
 [DataDesc[softmax_label,(25,),<type 'numpy.float32'>,NCHW]]
 ```

 Let's see a complete example of how to use data iterator in model training.
 ```python
 >>> data = mx.sym.Variable('data')
 >>> label = mx.sym.Variable('softmax_label')
 >>> fullc = mx.sym.FullyConnected(data=data, num_hidden=1)
 >>> loss = mx.sym.SoftmaxOutput(data=fullc, label=label)
 >>> mod = mx.mod.Module(loss, data_names=['data'], label_names=['softmax_label'])
 >>> mod.bind(data_shapes=nd_iter.provide_data, label_shapes=nd_iter.provide_label)
 >>> mod.fit(nd_iter, num_epoch=2)
 ```

 A detailed tutorial is available at
 [Iterators - Loading data](http://mxnet.io/tutorials/basic/data.html).

 ## Data iterators

 ```eval_rst
     .. currentmodule:: mxnet
 ```

 ```eval_rst
 .. autosummary::
     :nosignatures:

     io.NDArrayIter
     io.CSVIter
     io.LibSVMIter
     io.ImageRecordIter
     io.ImageRecordInt8Iter
     io.ImageRecordUInt8Iter
     io.MNISTIter
     recordio.MXRecordIO
     recordio.MXIndexedRecordIO
     image.ImageIter
     image.ImageDetIter
 ```

 ## Helper classes and functions

 ### Data structures and other iterators

 ```eval_rst
 .. autosummary::
     :nosignatures:

     io.DataDesc
     io.DataBatch
     io.DataIter
     io.ResizeIter
     io.PrefetchingIter
     io.MXDataIter
 ```

 ### Functions to read and write RecordIO files

 ```eval_rst
 .. autosummary::
     :nosignatures:

     recordio.pack
     recordio.unpack
     recordio.unpack_img
     recordio.pack_img
 ```

 ## How to develop a new iterator

 Writing a new data iterator in Python is straightforward. Most MXNet
 training/inference programs accept an iterable object with ``provide_data``
 and ``provide_label`` properties.
 This [tutorial](http://mxnet.io/tutorials/basic/data.html) explains how to
 write an iterator from scratch.

 The following example demonstrates how to combine
 multiple data iterators into a single one. It can be used for multiple
 modality training such as image captioning, in which images are read by
 ``ImageRecordIter`` while documents are read by ``CSVIter``

 ```python
 class MultiIter:
     def __init__(self, iter_list):
         self.iters = iter_list
     def next(self):
         batches = [i.next() for i in self.iters]
         return DataBatch(data=[*b.data for b in batches],
                          label=[*b.label for b in batches])
     def reset(self):
         for i in self.iters:
             i.reset()
     @property
     def provide_data(self):
         return [*i.provide_data for i in self.iters]
     @property
     def provide_label(self):
         return [*i.provide_label for i in self.iters]

 iter = MultiIter([mx.io.ImageRecordIter('image.rec'), mx.io.CSVIter('txt.csv')])
 ```

 Parsing and performing another pre-processing such as augmentation may be expensive.
 If performance is critical, we can implement a data iterator in C++. Refer to
 [src/io](https://github.com/dmlc/mxnet/tree/master/src/io) for examples.

 ### How to change the batch layout

 By default, the backend engine treats the first dimension of each data and label variable in data
 iterators as the batch size (i.e. `NCHW` or `NT` layout). In order to override the axis for batch size,
 the `provide_data` (and `provide_label` if there is label) properties should include the layouts. This
 is especially useful in RNN since `TNC` layouts are often more efficient. For example:

 ```python
 @property
 def provide_data(self):
     return [DataDesc(name='seq_var', shape=(seq_length, batch_size), layout='TN')]
 ```
 The backend engine will recognize the index of `N` in the `layout` as the axis for batch size.

 ## API Reference

 <script type="text/javascript" src='../../../_static/js/auto_module_index.js'></script>

 ### mxnet.io - Data Iterators

 ```eval_rst
 .. automodule:: mxnet.io
     :noindex:
     :members: NDArrayIter, CSVIter, LibSVMIter, ImageRecordIter, ImageRecordUInt8Iter, MNISTIter
 ```

 ### mxnet.io - Helper Classes & Functions

 ```eval_rst
 .. automodule:: mxnet.io
     :noindex:
     :members: DataBatch, DataDesc, DataIter, MXDataIter, PrefetchingIter, ResizeIter
 ```

 ### mxnet.recordio

 ```eval_rst
 .. currentmodule:: mxnet.recordio

 .. automodule:: mxnet.recordio
     :members:

 ```

 ```eval_rst
 .. _name: mxnet.symbol.Symbol.name
 .. _shape: mxnet.ndarray.NDArray.shape

 ```

 <script>auto_index("api-reference");</script>
	<!--- Licensed to the Apache Software Foundation (ASF) under one -->
	<!--- or more contributor license agreements. See the NOTICE file -->
	<!--- distributed with this work for additional information -->
	<!--- regarding copyright ownership. The ASF licenses this file -->
	<!--- to you under the Apache License, Version 2.0 (the -->
	<!--- "License"); you may not use this file except in compliance -->
	<!--- with the License. You may obtain a copy of the License at -->

	<!--- http://www.apache.org/licenses/LICENSE-2.0 -->

	<!--- Unless required by applicable law or agreed to in writing, -->
	<!--- software distributed under the License is distributed on an -->
	<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
	<!--- KIND, either express or implied. See the License for the -->
	<!--- specific language governing permissions and limitations -->
	<!--- under the License. -->

	# Data Loading API

	## Overview

	This document summarizes supported data formats and iterator APIs to read the
	data including:

	```eval_rst
	.. autosummary::
	:nosignatures:

	mxnet.io
	mxnet.recordio
	mxnet.image
	```

	First, let's see how to write an iterator for a new data format.
	The following iterator can be used to train a symbol whose input data variable has
	name `data` and input label variable has name `softmax_label`.
	The iterator also provides information about the batch, including the
	shapes and name.

	```python
	>>> nd_iter = mx.io.NDArrayIter(data={'data':mx.nd.ones((100,10))},
	... label={'softmax_label':mx.nd.ones((100,))},
	... batch_size=25)
	>>> print(nd_iter.provide_data)
	[DataDesc[data,(25, 10L),<type 'numpy.float32'>,NCHW]]
	>>> print(nd_iter.provide_label)
	[DataDesc[softmax_label,(25,),<type 'numpy.float32'>,NCHW]]
	```

	Let's see a complete example of how to use data iterator in model training.
	```python
	>>> data = mx.sym.Variable('data')
	>>> label = mx.sym.Variable('softmax_label')
	>>> fullc = mx.sym.FullyConnected(data=data, num_hidden=1)
	>>> loss = mx.sym.SoftmaxOutput(data=fullc, label=label)
	>>> mod = mx.mod.Module(loss, data_names=['data'], label_names=['softmax_label'])
	>>> mod.bind(data_shapes=nd_iter.provide_data, label_shapes=nd_iter.provide_label)
	>>> mod.fit(nd_iter, num_epoch=2)
	```

	A detailed tutorial is available at
	[Iterators - Loading data](http://mxnet.io/tutorials/basic/data.html).

	## Data iterators

	```eval_rst
	.. currentmodule:: mxnet
	```

	```eval_rst
	.. autosummary::
	:nosignatures:

	io.NDArrayIter
	io.CSVIter
	io.LibSVMIter
	io.ImageRecordIter
	io.ImageRecordInt8Iter
	io.ImageRecordUInt8Iter
	io.MNISTIter
	recordio.MXRecordIO
	recordio.MXIndexedRecordIO
	image.ImageIter
	image.ImageDetIter
	```

	## Helper classes and functions

	### Data structures and other iterators

	```eval_rst
	.. autosummary::
	:nosignatures:

	io.DataDesc
	io.DataBatch
	io.DataIter
	io.ResizeIter
	io.PrefetchingIter
	io.MXDataIter
	```

	### Functions to read and write RecordIO files

	```eval_rst
	.. autosummary::
	:nosignatures:

	recordio.pack
	recordio.unpack
	recordio.unpack_img
	recordio.pack_img
	```

	## How to develop a new iterator

	Writing a new data iterator in Python is straightforward. Most MXNet
	training/inference programs accept an iterable object with ``provide_data``
	and ``provide_label`` properties.
	This [tutorial](http://mxnet.io/tutorials/basic/data.html) explains how to
	write an iterator from scratch.

	The following example demonstrates how to combine
	multiple data iterators into a single one. It can be used for multiple
	modality training such as image captioning, in which images are read by
	``ImageRecordIter`` while documents are read by ``CSVIter``

	```python
	class MultiIter:
	def __init__(self, iter_list):
	self.iters = iter_list
	def next(self):
	batches = [i.next() for i in self.iters]
	return DataBatch(data=[*b.data for b in batches],
	label=[*b.label for b in batches])
	def reset(self):
	for i in self.iters:
	i.reset()
	@property
	def provide_data(self):
	return [*i.provide_data for i in self.iters]
	@property
	def provide_label(self):
	return [*i.provide_label for i in self.iters]

	iter = MultiIter([mx.io.ImageRecordIter('image.rec'), mx.io.CSVIter('txt.csv')])
	```

	Parsing and performing another pre-processing such as augmentation may be expensive.
	If performance is critical, we can implement a data iterator in C++. Refer to
	[src/io](https://github.com/dmlc/mxnet/tree/master/src/io) for examples.

	### How to change the batch layout

	By default, the backend engine treats the first dimension of each data and label variable in data
	iterators as the batch size (i.e. `NCHW` or `NT` layout). In order to override the axis for batch size,
	the `provide_data` (and `provide_label` if there is label) properties should include the layouts. This
	is especially useful in RNN since `TNC` layouts are often more efficient. For example:

	```python
	@property
	def provide_data(self):
	return [DataDesc(name='seq_var', shape=(seq_length, batch_size), layout='TN')]
	```
	The backend engine will recognize the index of `N` in the `layout` as the axis for batch size.

	## API Reference

	<script type="text/javascript" src='../../../_static/js/auto_module_index.js'></script>

	### mxnet.io - Data Iterators

	```eval_rst
	.. automodule:: mxnet.io
	:noindex:
	:members: NDArrayIter, CSVIter, LibSVMIter, ImageRecordIter, ImageRecordUInt8Iter, MNISTIter
	```

	### mxnet.io - Helper Classes & Functions

	```eval_rst
	.. automodule:: mxnet.io
	:noindex:
	:members: DataBatch, DataDesc, DataIter, MXDataIter, PrefetchingIter, ResizeIter
	```

	### mxnet.recordio

	```eval_rst
	.. currentmodule:: mxnet.recordio

	.. automodule:: mxnet.recordio
	:members:

	```

	```eval_rst
	.. _name: mxnet.symbol.Symbol.name
	.. _shape: mxnet.ndarray.NDArray.shape

	```

	<script>auto_index("api-reference");</script>