blob: 10ab6c71d209e5393b7994f649ea1725a67a7f9d [file] [log] [blame]
## Create a Dataset Using RecordIO
RecordIO implements a file format for a sequence of records. We recommend storing images as records and packing them together. The benefits include:
* Storing images in a compact format--e.g., JPEG, for records--greatly reduces the size of the dataset on the disk.
* Packing data together allows continuous reading on the disk.
* RecordIO has a simple way to partition, simplifying distributed setting. We provide an example later.
We provide the [im2rec tool](https://github.com/dmlc/mxnet/blob/master/tools/im2rec.cc) so you can create an Image RecordIO dataset by yourself. The following walkthrough shows you how.
### Prerequisites
Download the data. You don't need to resize the images manually. You can use ```im2rec``` to resize them automatically. For details, see the "Extension: Using Multiple Labels for a Single Image," later in this topic.
### Step 1. Make an Image List File
After you download the data, you need to make an image list file. The format is:
```
integer_image_index \t label_index \t path_to_image
```
Typically, the program takes the list of names of all of the images, shuffles them, then separates them into two lists: a training filename list and a testing filename list. Write the list in the right format.
This is an example file:
```bash
95099 464 n04467665_17283.JPEG
10025081 412 ILSVRC2010_val_00025082.JPEG
74181 789 n01915811_2739.JPEG
10035553 859 ILSVRC2010_val_00035554.JPEG
10048727 929 ILSVRC2010_val_00048728.JPEG
94028 924 n01980166_4956.JPEG
1080682 650 n11807979_571.JPEG
972457 633 n07723039_1627.JPEG
7534 11 n01630670_4486.JPEG
1191261 249 n12407079_5106.JPEG
```
### Step 2. Create the Binary File
To generate a binary image, use `im2rec` in the tool folder. `im2rec` takes the path of the `_image list file_` you generated, the `_root path_` of the images, and the `_output file path_` as input. This process usually takes several hours, so be patient.
Sample command:
```bash
./bin/im2rec image.lst image_root_dir output.bin resize=256
```
For more details, run ```./bin/im2rec```.
### Extension: Multiple Labels for a Single Image
The `im2rec` tool and `mx.io.ImageRecordIter` have multi-label support for a single image.
For example, if you have four labels for a single image, you can use the following procedure to use the RecordIO tools.
1. Write the image list files as follows:
```
integer_image_index \t label_1 \t label_2 \t label_3 \t label_4 \t path_to_image
```
2. Run `im2rec`, adding a 'label_width=4' to the command argument, for example:
```bash
./bin/im2rec image.lst image_root_dir output.bin resize=256 label_width=4
```
3. In the iterator generation code, set `label_width=4` and `path_imglist=<<The PATH TO YOUR image.lst>>`, for example:
```python
dataiter = mx.io.ImageRecordIter(
path_imgrec="data/cifar/train.rec",
data_shape=(3,28,28),
path_imglist="data/cifar/image.lst",
label_width=4
)
```