blob: 374b2e26eeb9228d29f20a89860ffef6e9d439a9 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Data Preparation &mdash; incubator-singa 0.3.0 documentation</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="top" title="incubator-singa 0.3.0 documentation" href="../index.html"/>
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a href="../index.html" class="icon icon-home"> incubator-singa
<img src="../_static/singa.png" class="logo" />
</a>
<div class="version">
0.3.0
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../downloads.html">Download SINGA</a></li>
<li class="toctree-l1"><a class="reference internal" href="index.html">Documentation</a></li>
</ul>
<p class="caption"><span class="caption-text">Development</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../develop/schedule.html">Development Schedule</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/how-contribute.html">How to Contribute to SINGA</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/contribute-code.html">How to Contribute Code</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/contribute-docs.html">How to Contribute Documentation</a></li>
</ul>
<p class="caption"><span class="caption-text">Community</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../community/source-repository.html">Source Repository</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/mail-lists.html">Project Mailing Lists</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/issue-tracking.html">Issue Tracking</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/team-list.html">The SINGA Team</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">incubator-singa</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li>Data Preparation</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="data-preparation">
<span id="data-preparation"></span><h1>Data Preparation<a class="headerlink" href="#data-preparation" title="Permalink to this headline"></a></h1>
<hr class="docutils" />
<p>SINGA uses input layers to load data.
Users can store their data in any format (e.g., CSV or binary) and at any places
(e.g., disk file or HDFS) as long as there are corresponding input layers that
can read the data records and parse them.</p>
<p>To make it easy for users, SINGA provides a [StoreInputLayer] to read data
in the format of (string:key, string:value) tuples from a couple of sources.
These sources are abstracted using a <a class="reference external" href="#">Store</a> class which is a simple version of
the DB abstraction in Caffe. The base Store class provides the following operations
for reading and writing tuples,</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">Open</span><span class="p">(</span><span class="n">string</span> <span class="n">path</span><span class="p">,</span> <span class="n">Mode</span> <span class="n">mode</span><span class="p">);</span> <span class="o">//</span> <span class="nb">open</span> <span class="n">the</span> <span class="n">store</span> <span class="k">for</span> <span class="n">kRead</span> <span class="ow">or</span> <span class="n">kCreate</span> <span class="ow">or</span> <span class="n">kAppend</span>
<span class="n">Close</span><span class="p">();</span>
<span class="n">Read</span><span class="p">(</span><span class="n">string</span><span class="o">*</span> <span class="n">key</span><span class="p">,</span> <span class="n">string</span><span class="o">*</span> <span class="n">val</span><span class="p">);</span> <span class="o">//</span> <span class="n">read</span> <span class="n">a</span> <span class="nb">tuple</span><span class="p">;</span> <span class="k">return</span> <span class="n">false</span> <span class="k">if</span> <span class="n">fail</span>
<span class="n">Write</span><span class="p">(</span><span class="n">string</span> <span class="n">key</span><span class="p">,</span> <span class="n">string</span> <span class="n">val</span><span class="p">);</span> <span class="o">//</span> <span class="n">write</span> <span class="n">a</span> <span class="nb">tuple</span>
<span class="n">Flush</span><span class="p">();</span>
</pre></div>
</div>
<p>Currently, two implementations are provided, namely</p>
<ol class="simple">
<li>[KVFileStore] for storing tuples in <a class="reference external" href="#">KVFile</a> (a binary file).
The <em>create_data.cc</em> files in <em>examples/cifar10</em> and <em>examples/mnist</em> provide
examples of storing records using KVFileStore.</li>
<li>[TextFileStore] for storing tuples in plain text file (one line per tuple).</li>
</ol>
<p>The (key, value) tuple are parsed by subclasses of StoreInputLayer depending on the
format of the tuple,</p>
<ul class="simple">
<li>[ProtoRecordInputLayer] parses the value field from one
tuple into a [SingleLabelImageRecord], which is generated by Google Protobuf according
to [common.proto]. It can be used to store features for images (e.g., using the pixel field)
or other objects (using the data field). The key field is not used.</li>
<li>[CSVRecordInputLayer] parses one tuple as a CSV line (separated by comma).</li>
</ul>
<div class="section" id="using-built-in-record-format">
<span id="using-built-in-record-format"></span><h2>Using built-in record format<a class="headerlink" href="#using-built-in-record-format" title="Permalink to this headline"></a></h2>
<p>SingleLabelImageRecord is a built-in record in SINGA for storing image features.
It is used in the cifar10 and mnist examples.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">message</span> <span class="n">SingleLabelImageRecord</span> <span class="p">{</span>
<span class="n">repeated</span> <span class="n">int32</span> <span class="n">shape</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span> <span class="o">//</span> <span class="n">it</span> <span class="n">obtains</span> <span class="mi">3</span> <span class="p">(</span><span class="n">rgb</span> <span class="n">channels</span><span class="p">),</span> <span class="mi">32</span> <span class="p">(</span><span class="n">row</span><span class="p">),</span> <span class="mi">32</span> <span class="p">(</span><span class="n">col</span><span class="p">)</span>
<span class="n">optional</span> <span class="n">int32</span> <span class="n">label</span> <span class="o">=</span> <span class="mi">2</span><span class="p">;</span> <span class="o">//</span> <span class="n">label</span>
<span class="n">optional</span> <span class="nb">bytes</span> <span class="n">pixel</span> <span class="o">=</span> <span class="mi">3</span><span class="p">;</span> <span class="o">//</span> <span class="n">pixels</span>
<span class="n">repeated</span> <span class="nb">float</span> <span class="n">data</span> <span class="o">=</span> <span class="mi">4</span> <span class="p">[</span><span class="n">packed</span> <span class="o">=</span> <span class="n">true</span><span class="p">];</span> <span class="o">//</span> <span class="n">it</span> <span class="ow">is</span> <span class="n">used</span> <span class="k">for</span> <span class="n">normalization</span>
</pre></div>
</div>
<p>}</p>
<p>The data preparation instructions for the <a class="reference external" href="http://www.cs.toronto.edu/~kriz/cifar.html">CIFAR-10 image dataset</a>
will be elaborated here. This dataset consists of 60,000 32x32 color images in 10 classes, with 6,000 images per class.
There are 50,000 training images and 10,000 test images.
Each image has a single label. This dataset is stored in binary files with specific format.
SINGA comes with the <a class="reference external" href="https://github.com/apache/incubator-singa/blob/master/examples/cifar10/create_data.cc">create_data.cc</a>
to convert images in the binary files into <code class="docutils literal"><span class="pre">SingleLabelImageRecord</span></code>s and insert them into training and test stores.</p>
<ol>
<li><p class="first">Download raw data. The following command will download the dataset into <em>cifar-10-batches-bin</em> folder.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> # in SINGA_ROOT/examples/cifar10
$ cp Makefile.example Makefile // an example makefile is provided
$ make download
</pre></div>
</div>
</li>
<li><p class="first">Fill one record for each image, and insert it to store.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">KVFileStore</span> <span class="n">store</span><span class="p">;</span>
<span class="n">store</span><span class="o">.</span><span class="n">Open</span><span class="p">(</span><span class="n">output_file_path</span><span class="p">,</span> <span class="n">singa</span><span class="p">::</span><span class="n">io</span><span class="p">::</span><span class="n">kCreate</span><span class="p">);</span>
<span class="n">singa</span><span class="p">::</span><span class="n">SingleLabelImageRecord</span> <span class="n">image</span><span class="p">;</span>
<span class="k">for</span> <span class="p">(</span><span class="nb">int</span> <span class="n">image_id</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">image_id</span> <span class="o">&lt;</span> <span class="mi">50000</span><span class="p">;</span> <span class="n">image_id</span> <span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="o">//</span> <span class="n">fill</span> <span class="n">the</span> <span class="n">record</span> <span class="k">with</span> <span class="n">image</span> <span class="n">feature</span> <span class="ow">and</span> <span class="n">label</span> <span class="kn">from</span> <span class="nn">downloaded</span> <span class="n">binay</span> <span class="n">files</span>
<span class="n">string</span> <span class="nb">str</span><span class="p">;</span>
<span class="n">image</span><span class="o">.</span><span class="n">SerializeToString</span><span class="p">(</span><span class="o">&amp;</span><span class="nb">str</span><span class="p">);</span>
<span class="n">store</span><span class="o">.</span><span class="n">Write</span><span class="p">(</span><span class="n">to_string</span><span class="p">(</span><span class="n">image_id</span><span class="p">),</span> <span class="nb">str</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">store</span><span class="o">.</span><span class="n">Flush</span><span class="p">();</span>
<span class="n">store</span><span class="o">.</span><span class="n">Close</span><span class="p">();</span>
</pre></div>
</div>
<p>The data store for testing data is created similarly.
In addition, it computes average values (not shown here) of image pixels and
insert the mean values into a SingleLabelImageRecord, which is then written
into a another store.</p>
</li>
<li><p class="first">Compile and run the program. SINGA provides an example Makefile that contains instructions
for compiling the source code and linking it with <em>libsinga.so</em>. Users just execute the following command.</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> $ make create
</pre></div>
</div>
</li>
</ol>
</div>
<div class="section" id="using-user-defined-record-format">
<span id="using-user-defined-record-format"></span><h2>using user-defined record format<a class="headerlink" href="#using-user-defined-record-format" title="Permalink to this headline"></a></h2>
<p>If users cannot use the SingleLabelImageRecord or CSV record for their data.
They can define their own record format e.g., using Google Protobuf.
A record can be written into a data store as long as it can be converted
into byte string. Correspondingly, subclasses of StoreInputLayer are required to
parse user-defined records.</p>
</div>
</div>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2016 The Apache Software Foundation. All rights reserved. Apache Singa, Apache, the Apache feather logo, and the Apache Singa project logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners..
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT:'../',
VERSION:'0.3.0',
COLLAPSE_INDEX:false,
FILE_SUFFIX:'.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.StickyNav.enable();
});
</script>
<div class="rst-versions shift-up" data-toggle="rst-versions" role="note" aria-label="versions">
<img src="../_static/apache.jpg">
<span class="rst-current-version" data-toggle="rst-current-version">
<span class="fa fa-book"> incubator-singa </span>
v: 0.3.0
<span class="fa fa-caret-down"></span>
</span>
<div class="rst-other-versions">
<dl>
<dt>Languages</dt>
<dd><a href="../../en/index.html">English</a></dd>
<dd><a href="../../zh/index.html">中文</a></dd>
<dd><a href="../../jp/index.html">日本語</a></dd>
<dd><a href="../../kr/index.html">한국어</a></dd>
</dl>
</div>
</div>
<a href="https://github.com/apache/incubator-singa">
<img style="position: absolute; top: 0; right: 0; border: 0; z-index: 10000;"
src="https://s3.amazonaws.com/github/ribbons/forkme_right_orange_ff7600.png"
alt="Fork me on GitHub">
</a>
</body>
</html>