content/v0.3.0/en/_sources/docs/mesos.txt - singa-site - Git at Google

 #Distributed Training on Mesos

 This guide explains how to start SINGA distributed training on a Mesos cluster. It assumes that both Mesos and HDFS are already running, and every node has SINGA installed.
 We assume the architecture depicted below, in which a cluster nodes are Docker container. Refer to [Docker guide](docker.html) for details of how to start individual nodes and set up network connection between them (make sure [weave](http://weave.works/guides/weave-docker-ubuntu-simple.html) is running at each node, and the cluster's headnode is running in container `node0`)

 ![Nothing](http://www.comp.nus.edu.sg/~dinhtta/files/singa_mesos.png)

 ---

 ## Start HDFS and Mesos
 Go inside each container, using:
 ````
 docker exec -it nodeX /bin/bash
 ````
 and configure it as follows:

 * On container `node0`

         hadoop namenode -format
         hadoop-daemon.sh start namenode
         /opt/mesos-0.22.0/build/bin/mesos-master.sh --work_dir=/opt --log_dir=/opt --quiet > /dev/null &
         zk-service.sh start

 * On container `node1, node2, ...`

         hadoop-daemon.sh start datanode
         /opt/mesos-0.22.0/build/bin/mesos-slave.sh --master=node0:5050 --hostname=XX.XX.XX.XX --log_dir=/opt --quiet > /dev/null &

   where XX.XX.XX.XX is the **public IP address** of the slave node

 To check if the setup has been successful, check that HDFS namenode has registered `N` datanodes, via:

 ````
 hadoop dfsadmin -report
 ````

 ####Important If the Docker version is 1.9 or newer, make sure [name resolution is set up
 properly](docker.html#launch_pseudo)

 #### Mesos logs
 Mesos logs are stored at `/opt/lt-mesos-master.INFO` on `node0` and `/opt/lt-mesos-slave.INFO` at other nodes.

 ---

 ## Starting SINGA training on Mesos
 Assumed that Mesos and HDFS are already started, SINGA job can be launched at **any** container.

 #### Launching job

 1. Log in to any container, then go to `incubator-singa/tool/mesos`
 <a name="job_start"></a>
 2. Check that configuration files are correct:
     + `scheduler.conf` contains information about the master nodes
     + `singa.conf` contains information about Zookeeper node0
     + Job configuration file `job.conf` **contains full path to the examples directories (NO RELATIVE PATH!).**
 3. Start the job:
     + If starting for the first time:

               make
 	          ./scheduler <job config file> -scheduler_conf <scheduler config file> -singa_conf <SINGA config file>
     + If not the first time:

 	          ./scheduler <job config file>

 **Notes.** Each running job is given a `frameworkID`. Look for the log message of the form:

              Framework registered with XXX-XXX-XXX-XXX-XXX-XXX

 #### Monitoring and Debugging

 Each Mesos job is given a `frameworkID` and a *sandbox* directory is created for each job.
 The directory is in the specified `work_dir` (or `/tmp/mesos`) by default. For example, the error
 during SINGA execution can be found at:

             /tmp/mesos/slaves/xxxxx-Sx/frameworks/xxxxx/executors/SINGA_x/runs/latest/stderr

 Other artifacts, like files downloaded from HDFS (`job.conf`) and `stdout` can be found in the same
 directory.

 #### Stopping

 There are two way to kill the running job:

 1. If the scheduler is running in the foreground, simply kill it (using `Ctrl-C`, for example).

 2. If the scheduler is running in the background, kill it using Mesos's REST API:

           curl -d "frameworkId=XXX-XXX-XXX-XXX-XXX-XXX" -X POST http://<master>/master/shutdown
	#Distributed Training on Mesos

	This guide explains how to start SINGA distributed training on a Mesos cluster. It assumes that both Mesos and HDFS are already running, and every node has SINGA installed.
	We assume the architecture depicted below, in which a cluster nodes are Docker container. Refer to [Docker guide](docker.html) for details of how to start individual nodes and set up network connection between them (make sure [weave](http://weave.works/guides/weave-docker-ubuntu-simple.html) is running at each node, and the cluster's headnode is running in container `node0`)

	![Nothing](http://www.comp.nus.edu.sg/~dinhtta/files/singa_mesos.png)

	---

	## Start HDFS and Mesos
	Go inside each container, using:
	````
	docker exec -it nodeX /bin/bash
	````
	and configure it as follows:

	* On container `node0`

	hadoop namenode -format
	hadoop-daemon.sh start namenode
	/opt/mesos-0.22.0/build/bin/mesos-master.sh --work_dir=/opt --log_dir=/opt --quiet > /dev/null &
	zk-service.sh start

	* On container `node1, node2, ...`

	hadoop-daemon.sh start datanode
	/opt/mesos-0.22.0/build/bin/mesos-slave.sh --master=node0:5050 --hostname=XX.XX.XX.XX --log_dir=/opt --quiet > /dev/null &

	where XX.XX.XX.XX is the public IP address of the slave node

	To check if the setup has been successful, check that HDFS namenode has registered `N` datanodes, via:

	````
	hadoop dfsadmin -report
	````

	####Important If the Docker version is 1.9 or newer, make sure [name resolution is set up
	properly](docker.html#launch_pseudo)

	#### Mesos logs
	Mesos logs are stored at `/opt/lt-mesos-master.INFO` on `node0` and `/opt/lt-mesos-slave.INFO` at other nodes.

	---

	## Starting SINGA training on Mesos
	Assumed that Mesos and HDFS are already started, SINGA job can be launched at any container.

	#### Launching job

	1. Log in to any container, then go to `incubator-singa/tool/mesos`
	<a name="job_start"></a>
	2. Check that configuration files are correct:
	+ `scheduler.conf` contains information about the master nodes
	+ `singa.conf` contains information about Zookeeper node0
	+ Job configuration file `job.conf` contains full path to the examples directories (NO RELATIVE PATH!).
	3. Start the job:
	+ If starting for the first time:

	make
	./scheduler <job config file> -scheduler_conf <scheduler config file> -singa_conf <SINGA config file>
	+ If not the first time:

	./scheduler <job config file>

	Notes. Each running job is given a `frameworkID`. Look for the log message of the form:

	Framework registered with XXX-XXX-XXX-XXX-XXX-XXX

	#### Monitoring and Debugging

	Each Mesos job is given a `frameworkID` and a sandbox directory is created for each job.
	The directory is in the specified `work_dir` (or `/tmp/mesos`) by default. For example, the error
	during SINGA execution can be found at:

	/tmp/mesos/slaves/xxxxx-Sx/frameworks/xxxxx/executors/SINGA_x/runs/latest/stderr

	Other artifacts, like files downloaded from HDFS (`job.conf`) and `stdout` can be found in the same
	directory.

	#### Stopping

	There are two way to kill the running job:

	1. If the scheduler is running in the foreground, simply kill it (using `Ctrl-C`, for example).

	2. If the scheduler is running in the background, kill it using Mesos's REST API:

	curl -d "frameworkId=XXX-XXX-XXX-XXX-XXX-XXX" -X POST http://<master>/master/shutdown