blob: d745ded1118be1ff83a506264cf71c9c19bffc86 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>CheckPoint &mdash; incubator-singa 0.3.0 documentation</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="top" title="incubator-singa 0.3.0 documentation" href="../index.html"/>
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a href="../index.html" class="icon icon-home"> incubator-singa
<img src="../_static/singa.png" class="logo" />
</a>
<div class="version">
0.3.0
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../downloads.html">Download SINGA</a></li>
<li class="toctree-l1"><a class="reference internal" href="index.html">Documentation</a></li>
</ul>
<p class="caption"><span class="caption-text">Development</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../develop/schedule.html">Development Schedule</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/how-contribute.html">How to Contribute to SINGA</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/contribute-code.html">How to Contribute Code</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/contribute-docs.html">How to Contribute Documentation</a></li>
</ul>
<p class="caption"><span class="caption-text">Community</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../community/source-repository.html">Source Repository</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/mail-lists.html">Project Mailing Lists</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/issue-tracking.html">Issue Tracking</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/team-list.html">The SINGA Team</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">incubator-singa</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li>CheckPoint</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="checkpoint">
<span id="checkpoint"></span><h1>CheckPoint<a class="headerlink" href="#checkpoint" title="Permalink to this headline"></a></h1>
<hr class="docutils" />
<p>SINGA checkpoints model parameters onto disk periodically according to user
configured frequency. By checkpointing model parameters, we can</p>
<ol class="simple">
<li>resume the training from the last checkpointing. For example, if
the program crashes before finishing all training steps, we can continue
the training using checkpoint files.</li>
<li>use them to initialize a similar model. For example, the
parameters from training a RBM model can be used to initialize
a <a class="reference external" href="rbm.html">deep auto-encoder</a> model.</li>
</ol>
<div class="section" id="configuration">
<span id="configuration"></span><h2>Configuration<a class="headerlink" href="#configuration" title="Permalink to this headline"></a></h2>
<p>Checkpointing is controlled by two configuration fields:</p>
<ul class="simple">
<li><code class="docutils literal"><span class="pre">checkpoint_after</span></code>, start checkpointing after this number of training steps,</li>
<li><code class="docutils literal"><span class="pre">checkpoint_freq</span></code>, frequency of doing checkpointing.</li>
</ul>
<p>For example,</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="c1"># job.conf</span>
<span class="n">checkpoint_after</span><span class="p">:</span> <span class="mi">100</span>
<span class="n">checkpoint_frequency</span><span class="p">:</span> <span class="mi">300</span>
<span class="o">...</span>
</pre></div>
</div>
<p>Checkpointing files are located at <em>WORKSPACE/checkpoint/stepSTEP-workerWORKERID</em>.
<em>WORKSPACE</em> is configured in</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">cluster</span> <span class="p">{</span>
<span class="n">workspace</span><span class="p">:</span>
<span class="p">}</span>
</pre></div>
</div>
<p>For the above configuration, after training for 700 steps, there would be
two checkpointing files,</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">step400</span><span class="o">-</span><span class="n">worker0</span>
<span class="n">step700</span><span class="o">-</span><span class="n">worker0</span>
</pre></div>
</div>
</div>
<div class="section" id="application-resuming-training">
<span id="application-resuming-training"></span><h2>Application - resuming training<a class="headerlink" href="#application-resuming-training" title="Permalink to this headline"></a></h2>
<p>We can resume the training from the last checkpoint (i.e., step 700) by,</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="o">./</span><span class="nb">bin</span><span class="o">/</span><span class="n">singa</span><span class="o">-</span><span class="n">run</span><span class="o">.</span><span class="n">sh</span> <span class="o">-</span><span class="n">conf</span> <span class="n">JOB_CONF</span> <span class="o">-</span><span class="n">resume</span>
</pre></div>
</div>
<p>There is no change to the job configuration.</p>
</div>
<div class="section" id="application-model-initialization">
<span id="application-model-initialization"></span><h2>Application - model initialization<a class="headerlink" href="#application-model-initialization" title="Permalink to this headline"></a></h2>
<p>We can also use the checkpointing file from step 400 to initialize
a new model by configuring the new job as,</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="c1"># job.conf</span>
<span class="n">checkpoint</span> <span class="p">:</span> <span class="s2">&quot;WORKSPACE/checkpoint/step400-worker0&quot;</span>
<span class="o">...</span>
</pre></div>
</div>
<p>If there are multiple checkpointing files for the same snapshot due to model
partitioning, all the checkpointing files should be added,</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="c1"># job.conf</span>
<span class="n">checkpoint</span> <span class="p">:</span> <span class="s2">&quot;WORKSPACE/checkpoint/step400-worker0&quot;</span>
<span class="n">checkpoint</span> <span class="p">:</span> <span class="s2">&quot;WORKSPACE/checkpoint/step400-worker1&quot;</span>
<span class="o">...</span>
</pre></div>
</div>
<p>The training command is the same as starting a new job,</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="o">./</span><span class="nb">bin</span><span class="o">/</span><span class="n">singa</span><span class="o">-</span><span class="n">run</span><span class="o">.</span><span class="n">sh</span> <span class="o">-</span><span class="n">conf</span> <span class="n">JOB_CONF</span>
</pre></div>
</div>
</div>
</div>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2016 The Apache Software Foundation. All rights reserved. Apache Singa, Apache, the Apache feather logo, and the Apache Singa project logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners..
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT:'../',
VERSION:'0.3.0',
COLLAPSE_INDEX:false,
FILE_SUFFIX:'.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.StickyNav.enable();
});
</script>
<div class="rst-versions shift-up" data-toggle="rst-versions" role="note" aria-label="versions">
<img src="../_static/apache.jpg">
<span class="rst-current-version" data-toggle="rst-current-version">
<span class="fa fa-book"> incubator-singa </span>
v: 0.3.0
<span class="fa fa-caret-down"></span>
</span>
<div class="rst-other-versions">
<dl>
<dt>Languages</dt>
<dd><a href="../../en/index.html">English</a></dd>
<dd><a href="../../zh/index.html">中文</a></dd>
<dd><a href="../../jp/index.html">日本語</a></dd>
<dd><a href="../../kr/index.html">한국어</a></dd>
</dl>
</div>
</div>
<a href="https://github.com/apache/incubator-singa">
<img style="position: absolute; top: 0; right: 0; border: 0; z-index: 10000;"
src="https://s3.amazonaws.com/github/ribbons/forkme_right_orange_ff7600.png"
alt="Fork me on GitHub">
</a>
</body>
</html>