| # Data Preparation |
| |
| --- |
| |
| SINGA uses input layers to load data. |
| Users can store their data in any format (e.g., CSV or binary) and at any places |
| (e.g., disk file or HDFS) as long as there are corresponding input layers that |
| can read the data records and parse them. |
| |
| To make it easy for users, SINGA provides a [StoreInputLayer] to read data |
| in the format of (string:key, string:value) tuples from a couple of sources. |
| These sources are abstracted using a [Store]() class which is a simple version of |
| the DB abstraction in Caffe. The base Store class provides the following operations |
| for reading and writing tuples, |
| |
| Open(string path, Mode mode); // open the store for kRead or kCreate or kAppend |
| Close(); |
| |
| Read(string* key, string* val); // read a tuple; return false if fail |
| Write(string key, string val); // write a tuple |
| Flush(); |
| |
| Currently, two implementations are provided, namely |
| |
| 1. [KVFileStore] for storing tuples in [KVFile]() (a binary file). |
| The *create_data.cc* files in *examples/cifar10* and *examples/mnist* provide |
| examples of storing records using KVFileStore. |
| |
| 2. [TextFileStore] for storing tuples in plain text file (one line per tuple). |
| |
| The (key, value) tuple are parsed by subclasses of StoreInputLayer depending on the |
| format of the tuple, |
| |
| * [ProtoRecordInputLayer] parses the value field from one |
| tuple into a [SingleLabelImageRecord], which is generated by Google Protobuf according |
| to [common.proto]. It can be used to store features for images (e.g., using the pixel field) |
| or other objects (using the data field). The key field is not used. |
| |
| * [CSVRecordInputLayer] parses one tuple as a CSV line (separated by comma). |
| |
| |
| ## Using built-in record format |
| |
| SingleLabelImageRecord is a built-in record in SINGA for storing image features. |
| It is used in the cifar10 and mnist examples. |
| |
| message SingleLabelImageRecord { |
| repeated int32 shape = 1; // it obtains 3 (rgb channels), 32 (row), 32 (col) |
| optional int32 label = 2; // label |
| optional bytes pixel = 3; // pixels |
| repeated float data = 4 [packed = true]; // it is used for normalization |
| } |
| |
| The data preparation instructions for the [CIFAR-10 image dataset](http://www.cs.toronto.edu/~kriz/cifar.html) |
| will be elaborated here. This dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class. |
| There are 50,000 training images and 10,000 test images. |
| Each image has a single label. This dataset is stored in binary files with specific format. |
| SINGA comes with the [create_data.cc](https://github.com/apache/incubator-singa/blob/master/examples/cifar10/create_data.cc) |
| to convert images in the binary files into `SingleLabelImageRecord`s and insert them into training and test stores. |
| |
| 1. Download raw data. The following command will download the dataset into *cifar-10-batches-bin* folder. |
| |
| # in SINGA_ROOT/examples/cifar10 |
| $ cp Makefile.example Makefile // an example makefile is provided |
| $ make download |
| |
| 2. Fill one record for each image, and insert it to store. |
| |
| KVFileStore store; |
| store.Open(output_file_path, singa::io::kCreate); |
| |
| singa::SingleLabelImageRecord image; |
| for (int image_id = 0; image_id < 50000; image_id ++) { |
| // fill the record with image feature and label from downloaded binay files |
| string str; |
| image.SerializeToString(&str); |
| store.Write(to_string(image_id), str); |
| } |
| store.Flush(); |
| store.Close(); |
| |
| The data store for testing data is created similarly. |
| In addition, it computes average values (not shown here) of image pixels and |
| insert the mean values into a SingleLabelImageRecord, which is then written |
| into a another store. |
| |
| 3. Compile and run the program. SINGA provides an example Makefile that contains instructions |
| for compiling the source code and linking it with *libsinga.so*. Users just execute the following command. |
| |
| $ make create |
| |
| ## using user-defined record format |
| |
| If users cannot use the SingleLabelImageRecord or CSV record for their data. |
| They can define their own record format e.g., using Google Protobuf. |
| A record can be written into a data store as long as it can be converted |
| into byte string. Correspondingly, subclasses of StoreInputLayer are required to |
| parse user-defined records. |