blob: ef604bf7501e5b5cef8ac90dae12b3c81323170a [file] [log] [blame]
{"nbformat": 4, "cells": [{"source": "# Iterators - Loading data\nIn this tutorial, we focus on how to feed data into a training or inference program.\nMost training and inference modules in MXNet accept data iterators,\nwhich simplifies this procedure, especially when reading large datasets.\nHere we discuss the API conventions and several provided iterators.\n\n## Prerequisites\n\nTo complete this tutorial, we need: \n\n- MXNet. See the instructions for your operating system in [Setup and Installation](http://mxnet.io/get_started/install.html). \n\n- [OpenCV Python library](http://opencv.org/opencv-3-2.html), [Python Requests](http://docs.python-requests.org/en/master/), [Matplotlib](https://matplotlib.org/) and [Jupyter Notebook](http://jupyter.org/index.html).\n\n```\n$ pip install opencv-python requests matplotlib jupyter\n```\n- Set the environment variable `MXNET_HOME` to the root of the MXNet source folder. \n\n```\n$ git clone https://github.com/dmlc/mxnet ~/mxnet\n$ MXNET_HOME = '~/mxnet'\n```\n\n## MXNet Data Iterator \nData Iterators in *MXNet* are similar to Python iterator objects.\nIn Python, the function `iter` allows fetching items sequentially by calling `next()` on\n iterable objects such as a Python `list`.\nIterators provide an abstract interface for traversing various types of iterable collections\n without needing to expose details about the underlying data source.\n\nIn MXNet, data iterators return a batch of data as `DataBatch` on each call to `next`.\nA `DataBatch` often contains *n* training examples and their corresponding labels. Here *n* is the `batch_size` of the iterator. At the end of the data stream when there is no more data to read, the iterator raises ``StopIteration`` exception like Python `iter`. \nThe structure of `DataBatch` is defined [here](http://mxnet.io/api/python/io.html#mxnet.io.DataBatch).\n\nInformation such as name, shape, type and layout on each training example and their corresponding label can be provided as `DataDesc` data descriptor objects via the `provide_data` and `provide_label` properties in `DataBatch`.\nThe structure of `DataDesc` is defined [here](http://mxnet.io/api/python/io.html#mxnet.io.DataDesc).\n\nAll IO in MXNet is handled via `mx.io.DataIter` and its subclasses. In this tutorial, we'll discuss a few commonly used iterators provided by MXNet.\n\nBefore diving into the details let's setup the environment by importing some required packages:", "cell_type": "markdown", "metadata": {}}, {"source": "import mxnet as mx\n%matplotlib inline\nimport os\nimport subprocess\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport tarfile\n\nimport warnings\nwarnings.filterwarnings(\"ignore\", category=DeprecationWarning)", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "## Reading data in memory\nWhen data is stored in memory, backed by either an `NDArray` or ``numpy`` `ndarray`,\nwe can use the [__`NDArrayIter`__](http://mxnet.io/api/python/io.html#mxnet.io.NDArrayIter) to read data as below:", "cell_type": "markdown", "metadata": {}}, {"source": "import numpy as np\ndata = np.random.rand(100,3)\nlabel = np.random.randint(0, 10, (100,))\ndata_iter = mx.io.NDArrayIter(data=data, label=label, batch_size=30)\nfor batch in data_iter:\n print([batch.data, batch.label, batch.pad])", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "## Reading data from CSV files\nMXNet provides [`CSVIter`](http://mxnet.io/api/python/io.html#mxnet.io.CSVIter)\nto read from CSV files and can be used as below:", "cell_type": "markdown", "metadata": {}}, {"source": "#lets save `data` into a csv file first and try reading it back\nnp.savetxt('data.csv', data, delimiter=',')\ndata_iter = mx.io.CSVIter(data_csv='data.csv', data_shape=(3,), batch_size=30)\nfor batch in data_iter:\n print([batch.data, batch.pad])", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "## Custom Iterator\nWhen the built-in iterators do not suit your application needs,\nyou can create your own custom data iterator.\n\nAn iterator in _MXNet_ should\n1. Implement `next()` in ``Python2`` or `__next()__` in ``Python3``,\n returning a `DataBatch` or raising a `StopIteration` exception if at the end of the data stream.\n2. Implement the `reset()` method to restart reading from the beginning.\n3. Have a `provide_data` attribute, consisting of a list of `DataDesc` objects that store the name, shape, type and layout information of the data (more info [here](http://mxnet.io/api/python/io.html#mxnet.io.DataBatch)).\n4. Have a `provide_label` attribute consisting of a list of `DataDesc` objects that store the name, shape, type and layout information of the label.\n\nWhen creating a new iterator, you can either start from scratch and define an iterator or reuse one of the existing iterators.\nFor example, in the image captioning application, the input example is an image while the label is a sentence.\nThus we can create a new iterator by:\n- creating a `image_iter` by using `ImageRecordIter` which provides multithreaded pre-fetch and augmentation.\n- creating a `caption_iter` by using `NDArrayIter` or the bucketing iterator provided in the *rnn* package.\n- `next()` returns the combined result of `image_iter.next()` and `caption_iter.next()`\n\nThe example below shows how to create a Simple iterator.", "cell_type": "markdown", "metadata": {}}, {"source": "class SimpleIter(mx.io.DataIter):\n def __init__(self, data_names, data_shapes, data_gen,\n label_names, label_shapes, label_gen, num_batches=10):\n self._provide_data = zip(data_names, data_shapes)\n self._provide_label = zip(label_names, label_shapes)\n self.num_batches = num_batches\n self.data_gen = data_gen\n self.label_gen = label_gen\n self.cur_batch = 0\n\n def __iter__(self):\n return self\n\n def reset(self):\n self.cur_batch = 0\n\n def __next__(self):\n return self.next()\n\n @property\n def provide_data(self):\n return self._provide_data\n\n @property\n def provide_label(self):\n return self._provide_label\n\n def next(self):\n if self.cur_batch < self.num_batches:\n self.cur_batch += 1\n data = [mx.nd.array(g(d[1])) for d,g in zip(self._provide_data, self.data_gen)]\n label = [mx.nd.array(g(d[1])) for d,g in zip(self._provide_label, self.label_gen)]\n return mx.io.DataBatch(data, label)\n else:\n raise StopIteration", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "We can use the above defined `SimpleIter` to train a simple MLP program below:", "cell_type": "markdown", "metadata": {}}, {"source": "import mxnet as mx\nnum_classes = 10\nnet = mx.sym.Variable('data')\nnet = mx.sym.FullyConnected(data=net, name='fc1', num_hidden=64)\nnet = mx.sym.Activation(data=net, name='relu1', act_type=\"relu\")\nnet = mx.sym.FullyConnected(data=net, name='fc2', num_hidden=num_classes)\nnet = mx.sym.SoftmaxOutput(data=net, name='softmax')\nprint(net.list_arguments())\nprint(net.list_outputs())", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "Here, there are four variables that are learnable parameters:\nthe *weights* and *biases* of FullyConnected layers *fc1* and *fc2*,\ntwo variables for input data: *data* for the training examples\nand *softmax_label* contains the respective labels and the *softmax_output*.\n\nThe *data* variables are called free variables in MXNet's Symbol API.\nTo execute a Symbol, they need to be bound with data.\n[Click here learn more about Symbol](http://mxnet.io/tutorials/basic/symbol.html).\n\nWe use the data iterator to feed examples to a neural network via MXNet's `module` API.\n[Click here to learn more about Module](http://mxnet.io/tutorials/basic/module.html).", "cell_type": "markdown", "metadata": {}}, {"source": "import logging\nlogging.basicConfig(level=logging.INFO)\n\nn = 32\ndata_iter = SimpleIter(['data'], [(n, 100)],\n [lambda s: np.random.uniform(-1, 1, s)],\n ['softmax_label'], [(n,)],\n [lambda s: np.random.randint(0, num_classes, s)])\n\nmod = mx.mod.Module(symbol=net)\nmod.fit(data_iter, num_epoch=5)", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "## Record IO\nRecord IO is a file format used by MXNet for data IO.\nIt compactly packs the data for efficient read and writes from distributed file system like Hadoop HDFS and AWS S3.\nYou can learn more about the design of `RecordIO` [here](http://mxnet.io/architecture/note_data_loading.html).\n\nMXNet provides [__`MXRecordIO`__](http://mxnet.io/api/python/io.html#mxnet.recordio.MXRecordIO)\nand [__`MXIndexedRecordIO`__](http://mxnet.io/api/python/io.html#mxnet.recordio.MXIndexedRecordIO)\nfor sequential access of data and random access of the data.\n\n### MXRecordIO\nFirst, let's look at an example on how to read and write sequentially\nusing `MXRecordIO`. The files are named with a `.rec` extension.", "cell_type": "markdown", "metadata": {}}, {"source": "record = mx.recordio.MXRecordIO('tmp.rec', 'w')\nfor i in range(5):\n record.write('record_%d'%i)\nrecord.close()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "We can read the data back by opening the file with an option `r` as below:", "cell_type": "markdown", "metadata": {}}, {"source": "record = mx.recordio.MXRecordIO('tmp.rec', 'r')\nwhile True:\n item = record.read()\n if not item:\n break\n print (item)\nrecord.close()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "### MXIndexedRecordIO\n`MXIndexedRecordIO` supports random or indexed access to the data.\nWe will create an indexed record file and a corresponding index file as below:", "cell_type": "markdown", "metadata": {}}, {"source": "record = mx.recordio.MXIndexedRecordIO('tmp.idx', 'tmp.rec', 'w')\nfor i in range(5):\n record.write_idx(i, 'record_%d'%i)\nrecord.close()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "Now, we can access the individual records using the keys", "cell_type": "markdown", "metadata": {}}, {"source": "record = mx.recordio.MXIndexedRecordIO('tmp.idx', 'tmp.rec', 'r')\nrecord.read_idx(3)", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "You can also list all the keys in the file.", "cell_type": "markdown", "metadata": {}}, {"source": "record.keys", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "### Packing and Unpacking data\n\nEach record in a .rec file can contain arbitrary binary data. However, most deep learning tasks require data to be input in label/data format.\nThe `mx.recordio` package provides a few utility functions for such operations, namely: `pack`, `unpack`, `pack_img`, and `unpack_img`.\n\n#### Packing/Unpacking Binary Data\n\n[__`pack`__](http://mxnet.io/api/python/io.html#mxnet.recordio.pack) and [__`unpack`__](http://mxnet.io/api/python/io.html#mxnet.recordio.unpack) are used for storing float (or 1d array of float) label and binary data. The data is packed along with a header. The header structure is defined [here](http://mxnet.io/api/python/io.html#mxnet.recordio.IRHeader).", "cell_type": "markdown", "metadata": {}}, {"source": "# pack\ndata = 'data'\nlabel1 = 1.0\nheader1 = mx.recordio.IRHeader(flag=0, label=label1, id=1, id2=0)\ns1 = mx.recordio.pack(header1, data)\n\nlabel2 = [1.0, 2.0, 3.0]\nheader2 = mx.recordio.IRHeader(flag=3, label=label2, id=2, id2=0)\ns2 = mx.recordio.pack(header2, data)", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "# unpack\nprint(mx.recordio.unpack(s1))\nprint(mx.recordio.unpack(s2))", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "#### Packing/Unpacking Image Data\n\nMXNet provides [__`pack_img`__](http://mxnet.io/api/python/io.html#mxnet.recordio.pack_img) and [__`unpack_img`__](http://mxnet.io/api/python/io.html#mxnet.recordio.unpack_img) to pack/unpack image data.\nRecords packed by `pack_img` can be loaded by `mx.io.ImageRecordIter`.", "cell_type": "markdown", "metadata": {}}, {"source": "data = np.ones((3,3,1), dtype=np.uint8)\nlabel = 1.0\nheader = mx.recordio.IRHeader(flag=0, label=label, id=0, id2=0)\ns = mx.recordio.pack_img(header, data, quality=100, img_fmt='.jpg')", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "# unpack_img\nprint(mx.recordio.unpack_img(s))", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "#### Using tools/im2rec.py\nYou can also convert raw images into *RecordIO* format using the ``im2rec.py`` utility script that is provided in the MXNet [src/tools](https://github.com/dmlc/mxnet/tree/master/tools) folder.\nAn example of how to use the script for converting to *RecordIO* format is shown in the `Image IO` section below.\n\n## Image IO\n\nIn this section, we will learn how to preprocess and load image data in MXNet.\n\nThere are 4 ways of loading image data in MXNet.\n 1. Using [__mx.image.imdecode__](http://mxnet.io/api/python/io.html#mxnet.image.imdecode) to load raw image files.\n 2. Using [__`mx.img.ImageIter`__](http://mxnet.io/api/python/io.html#mxnet.image.ImageIter) implemented in Python which is very flexible to customization. It can read from .rec(`RecordIO`) files and raw image files.\n 3. Using [__`mx.io.ImageRecordIter`__](http://mxnet.io/api/python/io.html#mxnet.io.ImageRecordIter) implemented on the MXNet backend in C++. This is less flexible to customization but provides various language bindings.\n 4. Creating a Custom iterator inheriting `mx.io.DataIter`\n\n\n### Preprocessing Images\nImages can be preprocessed in different ways. We list some of them below:\n- Using `mx.io.ImageRecordIter` which is fast but not very flexible. It is great for simple tasks like image recognition but won't work for more complex tasks like detection and segmentation.\n- Using `mx.recordio.unpack_img` (or `cv2.imread`, `skimage`, etc) + `numpy` is flexible but slow due to Python Global Interpreter Lock (GIL).\n- Using MXNet provided `mx.image` package. It stores images in [__`NDArray`__](http://mxnet.io/tutorials/basic/ndarray.html) format and leverages MXNet's [dependency engine](http://mxnet.io/architecture/note_engine.html) to automatically parallelize processing and circumvent GIL.\n\nBelow, we demonstrate some of the frequently used preprocessing routines provided by the `mx.image` package.\n\nLet's download sample images that we can work with.", "cell_type": "markdown", "metadata": {}}, {"source": "fname = mx.test_utils.download(url='http://data.mxnet.io/data/test_images.tar.gz', dirname='data', overwrite=False)\ntar = tarfile.open(fname)\ntar.extractall(path='./data')\ntar.close()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "#### Loading raw images\n`mx.image.imdecode` lets us load the images. `imdecode` provides a similar interface to ``OpenCV``. \n\n**Note:** You will still need ``OpenCV``(not the CV2 Python library) installed to use `mx.image.imdecode`.", "cell_type": "markdown", "metadata": {}}, {"source": "img = mx.image.imdecode(open('data/test_images/ILSVRC2012_val_00000001.JPEG').read())\nplt.imshow(img.asnumpy()); plt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "#### Image Transformations", "cell_type": "markdown", "metadata": {}}, {"source": "# resize to w x h\ntmp = mx.image.imresize(img, 100, 70)\nplt.imshow(tmp.asnumpy()); plt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "# crop a random w x h region from image\ntmp, coord = mx.image.random_crop(img, (150, 200))\nprint(coord)\nplt.imshow(tmp.asnumpy()); plt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "### Loading Data using Image Iterators\n\nBefore we see how to read data using the two built-in Image iterators,\n lets get a sample __Caltech 101__ dataset\n that contains 101 classes of objects and converts them into record io format.\nDownload and unzip", "cell_type": "markdown", "metadata": {}}, {"source": "fname = mx.test_utils.download(url='http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz', dirname='data', overwrite=False)\ntar = tarfile.open(fname)\ntar.extractall(path='./data')\ntar.close()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "Let's take a look at the data. As you can see, under the root folder (./data/101_ObjectCategories) every category has a subfolder(./data/101_ObjectCategories/yin_yang).\n\nNow let's convert them into record io format using the `im2rec.py` utility script.\nFirst, we need to make a list that contains all the image files and their categories:", "cell_type": "markdown", "metadata": {}}, {"source": "os.system('python %s/tools/im2rec.py --list=1 --recursive=1 --shuffle=1 --test-ratio=0.2 data/caltech data/101_ObjectCategories'%MXNET_HOME)", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "The resulting list file (./data/caltech_train.lst) is in the format `index\\t(one or more label)\\tpath`. In this case, there is only one label for each image but you can modify the list to add in more for multi-label training.\n\nThen we can use this list to create our record io file:", "cell_type": "markdown", "metadata": {}}, {"source": "os.system(\"python %s/tools/im2rec.py --num-thread=4 --pass-through=1 data/caltech data/101_ObjectCategories\"%MXNET_HOME)", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "The record io files are now saved at here (./data)\n\n#### Using ImageRecordIter\n[__`ImageRecordIter`__](http://mxnet.io/api/python/io.html#mxnet.io.ImageRecordIter) can be used for loading image data saved in record io format. To use ImageRecordIter, simply create an instance by loading your record file:", "cell_type": "markdown", "metadata": {}}, {"source": "data_iter = mx.io.ImageRecordIter(\n path_imgrec=\"./data/caltech.rec\", # the target record file\n data_shape=(3, 227, 227), # output data shape. An 227x227 region will be cropped from the original image.\n batch_size=4, # number of samples per batch\n resize=256 # resize the shorter edge to 256 before cropping\n # ... you can add more augumentation options as defined in ImageRecordIter.\n )\ndata_iter.reset()\nbatch = data_iter.next()\ndata = batch.data[0]\nfor i in range(4):\n plt.subplot(1,4,i+1)\n plt.imshow(data[i].asnumpy().astype(np.uint8).transpose((1,2,0)))\nplt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "#### Using ImageIter\n[__ImageIter__](http://mxnet.io/api/python/io.html#mxnet.io.ImageIter) is a flexible interface that supports loading of images in both RecordIO and Raw format.", "cell_type": "markdown", "metadata": {}}, {"source": "data_iter = mx.image.ImageIter(batch_size=4, data_shape=(3, 227, 227),\n path_imgrec=\"./data/caltech.rec\",\n path_imgidx=\"./data/caltech.idx\" )\ndata_iter.reset()\nbatch = data_iter.next()\ndata = batch.data[0]\nfor i in range(4):\n plt.subplot(1,4,i+1)\n plt.imshow(data[i].asnumpy().astype(np.uint8).transpose((1,2,0)))\nplt.show()", "cell_type": "code", "execution_count": null, "outputs": [], "metadata": {}}, {"source": "\n<!-- INSERT SOURCE DOWNLOAD BUTTONS -->\n\n", "cell_type": "markdown", "metadata": {}}], "metadata": {"display_name": "", "name": "", "language": "python"}, "nbformat_minor": 2}