blob: a6bd612d182b29725f15c2be1c57bc26835d305a [file] [log] [blame]
<!DOCTYPE html>
<!--[if IE 8]><html class="no-js lt-ie9" lang="en" > <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js" lang="en" > <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Start HDFS and Mesos &mdash; incubator-singa 0.3.0 documentation</title>
<link rel="stylesheet" href="../_static/css/theme.css" type="text/css" />
<link rel="top" title="incubator-singa 0.3.0 documentation" href="../index.html"/>
<script src="../_static/js/modernizr.min.js"></script>
</head>
<body class="wy-body-for-nav" role="document">
<div class="wy-grid-for-nav">
<nav data-toggle="wy-nav-shift" class="wy-nav-side">
<div class="wy-side-scroll">
<div class="wy-side-nav-search">
<a href="../index.html" class="icon icon-home"> incubator-singa
<img src="../_static/singa.png" class="logo" />
</a>
<div class="version">
0.3.0
</div>
<div role="search">
<form id="rtd-search-form" class="wy-form" action="../search.html" method="get">
<input type="text" name="q" placeholder="Search docs" />
<input type="hidden" name="check_keywords" value="yes" />
<input type="hidden" name="area" value="default" />
</form>
</div>
</div>
<div class="wy-menu wy-menu-vertical" data-spy="affix" role="navigation" aria-label="main navigation">
<ul>
<li class="toctree-l1"><a class="reference internal" href="../downloads.html">Download SINGA</a></li>
<li class="toctree-l1"><a class="reference internal" href="index.html">Documentation</a></li>
</ul>
<p class="caption"><span class="caption-text">Development</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../develop/schedule.html">Development Schedule</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/how-contribute.html">How to Contribute to SINGA</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/contribute-code.html">How to Contribute Code</a></li>
<li class="toctree-l1"><a class="reference internal" href="../develop/contribute-docs.html">How to Contribute Documentation</a></li>
</ul>
<p class="caption"><span class="caption-text">Community</span></p>
<ul>
<li class="toctree-l1"><a class="reference internal" href="../community/source-repository.html">Source Repository</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/mail-lists.html">Project Mailing Lists</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/issue-tracking.html">Issue Tracking</a></li>
<li class="toctree-l1"><a class="reference internal" href="../community/team-list.html">The SINGA Team</a></li>
</ul>
</div>
</div>
</nav>
<section data-toggle="wy-nav-shift" class="wy-nav-content-wrap">
<nav class="wy-nav-top" role="navigation" aria-label="top navigation">
<i data-toggle="wy-nav-top" class="fa fa-bars"></i>
<a href="../index.html">incubator-singa</a>
</nav>
<div class="wy-nav-content">
<div class="rst-content">
<div role="navigation" aria-label="breadcrumbs navigation">
<ul class="wy-breadcrumbs">
<li><a href="../index.html">Docs</a> &raquo;</li>
<li>Start HDFS and Mesos</li>
<li class="wy-breadcrumbs-aside">
</li>
</ul>
<hr/>
</div>
<div role="main" class="document" itemscope="itemscope" itemtype="http://schema.org/Article">
<div itemprop="articleBody">
<p>#Distributed Training on Mesos</p>
<p>This guide explains how to start SINGA distributed training on a Mesos cluster. It assumes that both Mesos and HDFS are already running, and every node has SINGA installed.
We assume the architecture depicted below, in which a cluster nodes are Docker container. Refer to <a class="reference external" href="docker.html">Docker guide</a> for details of how to start individual nodes and set up network connection between them (make sure <a class="reference external" href="http://weave.works/guides/weave-docker-ubuntu-simple.html">weave</a> is running at each node, and the cluster&#8217;s headnode is running in container <code class="docutils literal"><span class="pre">node0</span></code>)</p>
<p><img alt="Nothing" src="http://www.comp.nus.edu.sg/~dinhtta/files/singa_mesos.png" /></p>
<hr class="docutils" />
<div class="section" id="start-hdfs-and-mesos">
<span id="start-hdfs-and-mesos"></span><h1>Start HDFS and Mesos<a class="headerlink" href="#start-hdfs-and-mesos" title="Permalink to this headline"></a></h1>
<p>Go inside each container, using:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">docker</span> <span class="n">exec</span> <span class="o">-</span><span class="n">it</span> <span class="n">nodeX</span> <span class="o">/</span><span class="nb">bin</span><span class="o">/</span><span class="n">bash</span>
</pre></div>
</div>
<p>and configure it as follows:</p>
<ul>
<li><p class="first">On container <code class="docutils literal"><span class="pre">node0</span></code></p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">hadoop</span> <span class="n">namenode</span> <span class="o">-</span><span class="nb">format</span>
<span class="n">hadoop</span><span class="o">-</span><span class="n">daemon</span><span class="o">.</span><span class="n">sh</span> <span class="n">start</span> <span class="n">namenode</span>
<span class="o">/</span><span class="n">opt</span><span class="o">/</span><span class="n">mesos</span><span class="o">-</span><span class="mf">0.22</span><span class="o">.</span><span class="mi">0</span><span class="o">/</span><span class="n">build</span><span class="o">/</span><span class="nb">bin</span><span class="o">/</span><span class="n">mesos</span><span class="o">-</span><span class="n">master</span><span class="o">.</span><span class="n">sh</span> <span class="o">--</span><span class="n">work_dir</span><span class="o">=/</span><span class="n">opt</span> <span class="o">--</span><span class="n">log_dir</span><span class="o">=/</span><span class="n">opt</span> <span class="o">--</span><span class="n">quiet</span> <span class="o">&gt;</span> <span class="o">/</span><span class="n">dev</span><span class="o">/</span><span class="n">null</span> <span class="o">&amp;</span>
<span class="n">zk</span><span class="o">-</span><span class="n">service</span><span class="o">.</span><span class="n">sh</span> <span class="n">start</span>
</pre></div>
</div>
</li>
<li><p class="first">On container <code class="docutils literal"><span class="pre">node1,</span> <span class="pre">node2,</span> <span class="pre">...</span></code></p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">hadoop</span><span class="o">-</span><span class="n">daemon</span><span class="o">.</span><span class="n">sh</span> <span class="n">start</span> <span class="n">datanode</span>
<span class="o">/</span><span class="n">opt</span><span class="o">/</span><span class="n">mesos</span><span class="o">-</span><span class="mf">0.22</span><span class="o">.</span><span class="mi">0</span><span class="o">/</span><span class="n">build</span><span class="o">/</span><span class="nb">bin</span><span class="o">/</span><span class="n">mesos</span><span class="o">-</span><span class="n">slave</span><span class="o">.</span><span class="n">sh</span> <span class="o">--</span><span class="n">master</span><span class="o">=</span><span class="n">node0</span><span class="p">:</span><span class="mi">5050</span> <span class="o">--</span><span class="n">hostname</span><span class="o">=</span><span class="n">XX</span><span class="o">.</span><span class="n">XX</span><span class="o">.</span><span class="n">XX</span><span class="o">.</span><span class="n">XX</span> <span class="o">--</span><span class="n">log_dir</span><span class="o">=/</span><span class="n">opt</span> <span class="o">--</span><span class="n">quiet</span> <span class="o">&gt;</span> <span class="o">/</span><span class="n">dev</span><span class="o">/</span><span class="n">null</span> <span class="o">&amp;</span>
</pre></div>
</div>
<p>where XX.XX.XX.XX is the <strong>public IP address</strong> of the slave node</p>
</li>
</ul>
<p>To check if the setup has been successful, check that HDFS namenode has registered <code class="docutils literal"><span class="pre">N</span></code> datanodes, via:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span><span class="n">hadoop</span> <span class="n">dfsadmin</span> <span class="o">-</span><span class="n">report</span>
</pre></div>
</div>
<p>####Important If the Docker version is 1.9 or newer, make sure <a class="reference external" href="docker.html#launch_pseudo">name resolution is set up
properly</a></p>
<div class="section" id="mesos-logs">
<span id="mesos-logs"></span><h2>Mesos logs<a class="headerlink" href="#mesos-logs" title="Permalink to this headline"></a></h2>
<p>Mesos logs are stored at <code class="docutils literal"><span class="pre">/opt/lt-mesos-master.INFO</span></code> on <code class="docutils literal"><span class="pre">node0</span></code> and <code class="docutils literal"><span class="pre">/opt/lt-mesos-slave.INFO</span></code> at other nodes.</p>
</div>
</div>
<hr class="docutils" />
<div class="section" id="starting-singa-training-on-mesos">
<span id="starting-singa-training-on-mesos"></span><h1>Starting SINGA training on Mesos<a class="headerlink" href="#starting-singa-training-on-mesos" title="Permalink to this headline"></a></h1>
<p>Assumed that Mesos and HDFS are already started, SINGA job can be launched at <strong>any</strong> container.</p>
<div class="section" id="launching-job">
<span id="launching-job"></span><h2>Launching job<a class="headerlink" href="#launching-job" title="Permalink to this headline"></a></h2>
<ol>
<li><p class="first">Log in to any container, then go to <code class="docutils literal"><span class="pre">incubator-singa/tool/mesos</span></code>
<a name="job_start"></a></p>
</li>
<li><p class="first">Check that configuration files are correct:</p>
<ul class="simple">
<li><code class="docutils literal"><span class="pre">scheduler.conf</span></code> contains information about the master nodes</li>
<li><code class="docutils literal"><span class="pre">singa.conf</span></code> contains information about Zookeeper node0</li>
<li>Job configuration file <code class="docutils literal"><span class="pre">job.conf</span></code> <strong>contains full path to the examples directories (NO RELATIVE PATH!).</strong></li>
</ul>
</li>
<li><p class="first">Start the job:</p>
<ul>
<li><p class="first">If starting for the first time:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">make</span>
<span class="o">./</span><span class="n">scheduler</span> <span class="o">&lt;</span><span class="n">job</span> <span class="n">config</span> <span class="n">file</span><span class="o">&gt;</span> <span class="o">-</span><span class="n">scheduler_conf</span> <span class="o">&lt;</span><span class="n">scheduler</span> <span class="n">config</span> <span class="n">file</span><span class="o">&gt;</span> <span class="o">-</span><span class="n">singa_conf</span> <span class="o">&lt;</span><span class="n">SINGA</span> <span class="n">config</span> <span class="n">file</span><span class="o">&gt;</span>
</pre></div>
</div>
</li>
<li><p class="first">If not the first time:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="o">./</span><span class="n">scheduler</span> <span class="o">&lt;</span><span class="n">job</span> <span class="n">config</span> <span class="n">file</span><span class="o">&gt;</span>
</pre></div>
</div>
</li>
</ul>
</li>
</ol>
<p><strong>Notes.</strong> Each running job is given a <code class="docutils literal"><span class="pre">frameworkID</span></code>. Look for the log message of the form:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">Framework</span> <span class="n">registered</span> <span class="k">with</span> <span class="n">XXX</span><span class="o">-</span><span class="n">XXX</span><span class="o">-</span><span class="n">XXX</span><span class="o">-</span><span class="n">XXX</span><span class="o">-</span><span class="n">XXX</span><span class="o">-</span><span class="n">XXX</span>
</pre></div>
</div>
</div>
<div class="section" id="monitoring-and-debugging">
<span id="monitoring-and-debugging"></span><h2>Monitoring and Debugging<a class="headerlink" href="#monitoring-and-debugging" title="Permalink to this headline"></a></h2>
<p>Each Mesos job is given a <code class="docutils literal"><span class="pre">frameworkID</span></code> and a <em>sandbox</em> directory is created for each job.
The directory is in the specified <code class="docutils literal"><span class="pre">work_dir</span></code> (or <code class="docutils literal"><span class="pre">/tmp/mesos</span></code>) by default. For example, the error
during SINGA execution can be found at:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="o">/</span><span class="n">tmp</span><span class="o">/</span><span class="n">mesos</span><span class="o">/</span><span class="n">slaves</span><span class="o">/</span><span class="n">xxxxx</span><span class="o">-</span><span class="n">Sx</span><span class="o">/</span><span class="n">frameworks</span><span class="o">/</span><span class="n">xxxxx</span><span class="o">/</span><span class="n">executors</span><span class="o">/</span><span class="n">SINGA_x</span><span class="o">/</span><span class="n">runs</span><span class="o">/</span><span class="n">latest</span><span class="o">/</span><span class="n">stderr</span>
</pre></div>
</div>
<p>Other artifacts, like files downloaded from HDFS (<code class="docutils literal"><span class="pre">job.conf</span></code>) and <code class="docutils literal"><span class="pre">stdout</span></code> can be found in the same
directory.</p>
</div>
<div class="section" id="stopping">
<span id="stopping"></span><h2>Stopping<a class="headerlink" href="#stopping" title="Permalink to this headline"></a></h2>
<p>There are two way to kill the running job:</p>
<ol>
<li><p class="first">If the scheduler is running in the foreground, simply kill it (using <code class="docutils literal"><span class="pre">Ctrl-C</span></code>, for example).</p>
</li>
<li><p class="first">If the scheduler is running in the background, kill it using Mesos&#8217;s REST API:</p>
<div class="highlight-default"><div class="highlight"><pre><span></span> <span class="n">curl</span> <span class="o">-</span><span class="n">d</span> <span class="s2">&quot;frameworkId=XXX-XXX-XXX-XXX-XXX-XXX&quot;</span> <span class="o">-</span><span class="n">X</span> <span class="n">POST</span> <span class="n">http</span><span class="p">:</span><span class="o">//&lt;</span><span class="n">master</span><span class="o">&gt;/</span><span class="n">master</span><span class="o">/</span><span class="n">shutdown</span>
</pre></div>
</div>
</li>
</ol>
</div>
</div>
</div>
</div>
<footer>
<hr/>
<div role="contentinfo">
<p>
&copy; Copyright 2016 The Apache Software Foundation. All rights reserved. Apache Singa, Apache, the Apache feather logo, and the Apache Singa project logos are trademarks of The Apache Software Foundation. All other marks mentioned may be trademarks or registered trademarks of their respective owners..
</p>
</div>
Built with <a href="http://sphinx-doc.org/">Sphinx</a> using a <a href="https://github.com/snide/sphinx_rtd_theme">theme</a> provided by <a href="https://readthedocs.org">Read the Docs</a>.
</footer>
</div>
</div>
</section>
</div>
<script type="text/javascript">
var DOCUMENTATION_OPTIONS = {
URL_ROOT:'../',
VERSION:'0.3.0',
COLLAPSE_INDEX:false,
FILE_SUFFIX:'.html',
HAS_SOURCE: true
};
</script>
<script type="text/javascript" src="../_static/jquery.js"></script>
<script type="text/javascript" src="../_static/underscore.js"></script>
<script type="text/javascript" src="../_static/doctools.js"></script>
<script type="text/javascript" src="../_static/js/theme.js"></script>
<script type="text/javascript">
jQuery(function () {
SphinxRtdTheme.StickyNav.enable();
});
</script>
<div class="rst-versions shift-up" data-toggle="rst-versions" role="note" aria-label="versions">
<img src="../_static/apache.jpg">
<span class="rst-current-version" data-toggle="rst-current-version">
<span class="fa fa-book"> incubator-singa </span>
v: 0.3.0
<span class="fa fa-caret-down"></span>
</span>
<div class="rst-other-versions">
<dl>
<dt>Languages</dt>
<dd><a href="../../en/index.html">English</a></dd>
<dd><a href="../../zh/index.html">中文</a></dd>
<dd><a href="../../jp/index.html">日本語</a></dd>
<dd><a href="../../kr/index.html">한국어</a></dd>
</dl>
</div>
</div>
<a href="https://github.com/apache/incubator-singa">
<img style="position: absolute; top: 0; right: 0; border: 0; z-index: 10000;"
src="https://s3.amazonaws.com/github/ribbons/forkme_right_orange_ff7600.png"
alt="Fork me on GitHub">
</a>
</body>
</html>