blob: a6e3076542c4af1ae0652f27c32fe00cab90e1d5 [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Distributed Training Framework &mdash; incubator-singa 0.3.0 documentation</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="top" title="incubator-singa 0.3.0 documentation" href="../index.html"/>
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a href="../index.html" class="icon icon-home"> incubator-singa
<img src="../_static/singa.png" class="logo" />
</a>
<div class="version">
0.3.0
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../downloads.html">Download SINGA</a></li>
<li class="toctree-l1"><a class="reference internal" href="index.html">Documentation</a></li>
</ul>
<p class="caption"><span class="caption-text">Development</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../develop/schedule.html">Development Schedule</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/how-contribute.html">How to Contribute to SINGA</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/contribute-code.html">How to Contribute Code</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/contribute-docs.html">How to Contribute Documentation</a></li>
</ul>
<p class="caption"><span class="caption-text">Community</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../community/source-repository.html">Source Repository</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/mail-lists.html">Project Mailing Lists</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/issue-tracking.html">Issue Tracking</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/team-list.html">The SINGA Team</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">incubator-singa</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li>Distributed Training Framework</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<div class="section" id="distributed-training-framework">
<span id="distributed-training-framework"></span><h1>Distributed Training Framework<a class="headerlink" href="#distributed-training-framework" title="Permalink to this headline"></a></h1>
<hr class="docutils" />
<div class="section" id="cluster-topology-configuration">
<span id="cluster-topology-configuration"></span><h2>Cluster Topology Configuration<a class="headerlink" href="#cluster-topology-configuration" title="Permalink to this headline"></a></h2>
<p>Here we describe how to configure SINGA&#8217;s cluster topology to support
different distributed training frameworks.
The cluster topology is configured in the <code class="docutils literal"><span class="pre">cluster</span></code> field in <code class="docutils literal"><span class="pre">JobProto</span></code>.
The <code class="docutils literal"><span class="pre">cluster</span></code> is of type <code class="docutils literal"><span class="pre">ClusterProto</span></code>:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span>message ClusterProto {
optional int32 nworker_groups = 1;
optional int32 nserver_groups = 2;
optional int32 nworkers_per_group = 3 [default = 1];
optional int32 nservers_per_group = 4 [default = 1];
optional int32 nworkers_per_procs = 5 [default = 1];
optional int32 nservers_per_procs = 6 [default = 1];
// servers and workers in different processes?
optional bool server_worker_separate = 20 [default = false];
......
}
</pre></div>
</div>
<p>The mostly used fields are as follows:</p>
<ul class="simple">
<li><code class="docutils literal"><span class="pre">nworkers_per_group</span></code> and <code class="docutils literal"><span class="pre">nworkers_per_procs</span></code>:
decide the partitioning of worker side ParamShard.</li>
<li><code class="docutils literal"><span class="pre">nservers_per_group</span></code> and <code class="docutils literal"><span class="pre">nservers_per_procs</span></code>:
decide the partitioning of server side ParamShard.</li>
<li><code class="docutils literal"><span class="pre">server_worker_separate</span></code>:
separate servers and workers in different processes.</li>
</ul>
</div>
<div class="section" id="different-training-frameworks">
<span id="different-training-frameworks"></span><h2>Different Training Frameworks<a class="headerlink" href="#different-training-frameworks" title="Permalink to this headline"></a></h2>
<p>In SINGA, worker groups run asynchronously and
workers within one group run synchronously.
Users can leverage this general design to run
both <strong>synchronous</strong> and <strong>asynchronous</strong> training frameworks.
Here we illustrate how to configure
popular distributed training frameworks in SINGA.</p>
<p><img src="../_static/images/frameworks.png" style="width: 800px"/></p>
<p><strong> Fig.1 - Training frameworks in SINGA</strong></p><p>###Sandblaster</p>
<p>This is a <strong>synchronous</strong> framework used by Google Brain.
Fig.2(a) shows the Sandblaster framework implemented in SINGA.
Its configuration is as follows:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">cluster</span> <span class="p">{</span>
<span class="n">nworker_groups</span><span class="p">:</span> <span class="mi">1</span>
<span class="n">nserver_groups</span><span class="p">:</span> <span class="mi">1</span>
<span class="n">nworkers_per_group</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">nservers_per_group</span><span class="p">:</span> <span class="mi">2</span>
<span class="n">server_worker_separate</span><span class="p">:</span> <span class="n">true</span>
<span class="p">}</span>
</pre></div>
</div>
<p>A single server group is launched to handle all requests from workers.
A worker computes on its partition of the model,
and only communicates with servers handling related parameters.</p>
<p>###AllReduce</p>
<p>This is a <strong>synchronous</strong> framework used by Baidu&#8217;s DeepImage.
Fig.2(b) shows the AllReduce framework implemented in SINGA.
Its configuration is as follows:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">cluster</span> <span class="p">{</span>
<span class="n">nworker_groups</span><span class="p">:</span> <span class="mi">1</span>
<span class="n">nserver_groups</span><span class="p">:</span> <span class="mi">1</span>
<span class="n">nworkers_per_group</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">nservers_per_group</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">server_worker_separate</span><span class="p">:</span> <span class="n">false</span>
<span class="p">}</span>
</pre></div>
</div>
<p>We bind each worker with a server on the same node, so that each
node is responsible for maintaining a partition of parameters and
collecting updates from all other nodes.</p>
<p>###Downpour</p>
<p>This is a <strong>asynchronous</strong> framework used by Google Brain.
Fig.2(c) shows the Downpour framework implemented in SINGA.
Its configuration is as follows:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">cluster</span> <span class="p">{</span>
<span class="n">nworker_groups</span><span class="p">:</span> <span class="mi">2</span>
<span class="n">nserver_groups</span><span class="p">:</span> <span class="mi">1</span>
<span class="n">nworkers_per_group</span><span class="p">:</span> <span class="mi">2</span>
<span class="n">nservers_per_group</span><span class="p">:</span> <span class="mi">2</span>
<span class="n">server_worker_separate</span><span class="p">:</span> <span class="n">true</span>
<span class="p">}</span>
</pre></div>
</div>
<p>Similar to the synchronous Sandblaster, all workers send
requests to a global server group. We divide workers into several
worker groups, each running independently and working on parameters
from the last <em>update</em> response.</p>
<p>###Distributed Hogwild</p>
<p>This is a <strong>asynchronous</strong> framework used by Caffe.
Fig.2(d) shows the Distributed Hogwild framework implemented in SINGA.
Its configuration is as follows:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">cluster</span> <span class="p">{</span>
<span class="n">nworker_groups</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">nserver_groups</span><span class="p">:</span> <span class="mi">3</span>
<span class="n">nworkers_per_group</span><span class="p">:</span> <span class="mi">1</span>
<span class="n">nservers_per_group</span><span class="p">:</span> <span class="mi">1</span>
<span class="n">server_worker_separate</span><span class="p">:</span> <span class="n">false</span>
<span class="p">}</span>
</pre></div>
</div>
<p>Each node contains a complete server group and a complete worker group.
Parameter updates are done locally, so that communication cost
during each training step is minimized.
However, the server group must periodically synchronize with
neighboring groups to improve the training convergence.</p>
</div>
</div>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2016 The Apache Software Foundation. All rights reserved. Apache Singa, Apache, the Apache feather logo, and the Apache Singa project logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners..
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT:'../',
VERSION:'0.3.0',
COLLAPSE_INDEX:false,
FILE_SUFFIX:'.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.StickyNav.enable();
});
</script>
<div class="rst-versions shift-up" data-toggle="rst-versions" role="note" aria-label="versions">
<img src="../_static/apache.jpg">
<span class="rst-current-version" data-toggle="rst-current-version">
<span class="fa fa-book"> incubator-singa </span>
v: 0.3.0
<span class="fa fa-caret-down"></span>
</span>
<div class="rst-other-versions">
<dl>
<dt>Languages</dt>
<dd><a href="../../en/index.html">English</a></dd>
<dd><a href="../../zh/index.html">中文</a></dd>
<dd><a href="../../jp/index.html">日本語</a></dd>
<dd><a href="../../kr/index.html">한국어</a></dd>
</dl>
</div>
</div>
<a href="https://github.com/apache/incubator-singa">
<img style="position: absolute; top: 0; right: 0; border: 0; z-index: 10000;"
src="https://s3.amazonaws.com/github/ribbons/forkme_right_orange_ff7600.png"
alt="Fork me on GitHub">
</a>
</body>
</html>