blob: a3582e4e585d4f63d8909da89b400a94438989d2 [file] [log] [blame]
# 데이터 준비
---
SINGA 는 데이터를 로딩하기 위하여 input layers 를 이용합니다.
Users can store their data in any format (e.g., CSV or binary) and at any places
(e.g., disk file or HDFS) as long as there are corresponding input layers that
can read the data records and parse them.
To make it easy for users, SINGA provides a [StoreInputLayer] to read data
in the format of (string:key, string:value) tuples from a couple of sources.
These sources are abstracted using a [Store]() class which is a simple version of
the DB abstraction in Caffe. The base Store class provides the following operations
for reading and writing tuples,
Open(string path, Mode mode); // open the store for kRead or kCreate or kAppend
Close();
Read(string* key, string* val); // read a tuple; return false if fail
Write(string key, string val); // write a tuple
Flush();
Currently, two implementations are provided, namely
1. [KVFileStore] for storing tuples in [KVFile]() (a binary file).
The *create_data.cc* files in *examples/cifar10* and *examples/mnist* provide
examples of storing records using KVFileStore.
2. [TextFileStore] for storing tuples in plain text file (one line per tuple).
The (key, value) tuple are parsed by subclasses of StoreInputLayer depending on the
format of the tuple,
* [ProtoRecordInputLayer] parses the value field from one
tuple into a [SingleLabelImageRecord], which is generated by Google Protobuf according
to [common.proto]. It can be used to store features for images (e.g., using the pixel field)
or other objects (using the data field). The key field is not used.
* [CSVRecordInputLayer] parses one tuple as a CSV line (separated by comma).
## Using built-in record format
SingleLabelImageRecord is a built-in record in SINGA for storing image features.
It is used in the cifar10 and mnist examples.
message SingleLabelImageRecord {
repeated int32 shape = 1; // it obtains 3 (rgb channels), 32 (row), 32 (col)
optional int32 label = 2; // label
optional bytes pixel = 3; // pixels
repeated float data = 4 [packed = true]; // it is used for normalization
}
The data preparation instructions for the [CIFAR-10 image dataset](http://www.cs.toronto.edu/~kriz/cifar.html)
will be elaborated here. This dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class.
There are 50,000 training images and 10,000 test images.
Each image has a single label. This dataset is stored in binary files with specific format.
SINGA comes with the [create_data.cc](https://github.com/apache/incubator-singa/blob/master/examples/cifar10/create_data.cc)
to convert images in the binary files into `SingleLabelImageRecord`s and insert them into training and test stores.
1. Download raw data. The following command will download the dataset into *cifar-10-batches-bin* folder.
# in SINGA_ROOT/examples/cifar10
$ cp Makefile.example Makefile // an example makefile is provided
$ make download
2. Fill one record for each image, and insert it to store.
KVFileStore store;
store.Open(output_file_path, singa::io::kCreate);
singa::SingleLabelImageRecord image;
for (int image_id = 0; image_id < 50000; image_id ++) {
// fill the record with image feature and label from downloaded binay files
string str;
image.SerializeToString(&str);
store.Write(to_string(image_id), str);
}
store.Flush();
store.Close();
The data store for testing data is created similarly.
In addition, it computes average values (not shown here) of image pixels and
insert the mean values into a SingleLabelImageRecord, which is then written
into a another store.
3. Compile and run the program. SINGA provides an example Makefile that contains instructions
for compiling the source code and linking it with *libsinga.so*. Users just execute the following command.
$ make create
## using user-defined record format
If users cannot use the SingleLabelImageRecord or CSV record for their data.
They can define their own record format e.g., using Google Protobuf.
A record can be written into a data store as long as it can be converted
into byte string. Correspondingly, subclasses of StoreInputLayer are required to
parse user-defined records.