blob: a41aa61ec348e90a7e67b440556cf81405fb1cea [file] [log] [blame]
#Distributed Training on Mesos
This guide explains how to start SINGA distributed training on a Mesos cluster. It assumes that both Mesos and HDFS are already running, and every node has SINGA installed.
We assume the architecture depicted below, in which a cluster nodes are Docker container. Refer to [Docker guide](docker.html) for details of how to start individual nodes and set up network connection between them (make sure [weave](http://weave.works/guides/weave-docker-ubuntu-simple.html) is running at each node, and the cluster's headnode is running in container `node0`)
![Nothing](http://www.comp.nus.edu.sg/~dinhtta/files/singa_mesos.png)
---
## Start HDFS and Mesos
Go inside each container, using:
````
docker exec -it nodeX /bin/bash
````
and configure it as follows:
* On container `node0`
hadoop namenode -format
hadoop-daemon.sh start namenode
/opt/mesos-0.22.0/build/bin/mesos-master.sh --work_dir=/opt --log_dir=/opt --quiet > /dev/null &
zk-service.sh start
* On container `node1, node2, ...`
hadoop-daemon.sh start datanode
/opt/mesos-0.22.0/build/bin/mesos-slave.sh --master=node0:5050 --hostname=XX.XX.XX.XX --log_dir=/opt --quiet > /dev/null &
where XX.XX.XX.XX is the **public IP address** of the slave node
To check if the setup has been successful, check that HDFS namenode has registered `N` datanodes, via:
````
hadoop dfsadmin -report
````
####Important If the Docker version is 1.9 or newer, make sure [name resolution is set up
properly](docker.html#launch_pseudo)
#### Mesos logs
Mesos logs are stored at `/opt/lt-mesos-master.INFO` on `node0` and `/opt/lt-mesos-slave.INFO` at other nodes.
---
## Starting SINGA training on Mesos
Assumed that Mesos and HDFS are already started, SINGA job can be launched at **any** container.
#### Launching job
1. Log in to any container, then go to `incubator-singa/tool/mesos`
<a name="job_start"></a>
2. Check that configuration files are correct:
+ `scheduler.conf` contains information about the master nodes
+ `singa.conf` contains information about Zookeeper node0
+ Job configuration file `job.conf` **contains full path to the examples directories (NO RELATIVE PATH!).**
3. Start the job:
+ If starting for the first time:
make
./scheduler <job config file> -scheduler_conf <scheduler config file> -singa_conf <SINGA config file>
+ If not the first time:
./scheduler <job config file>
**Notes.** Each running job is given a `frameworkID`. Look for the log message of the form:
Framework registered with XXX-XXX-XXX-XXX-XXX-XXX
#### Monitoring and Debugging
Each Mesos job is given a `frameworkID` and a *sandbox* directory is created for each job.
The directory is in the specified `work_dir` (or `/tmp/mesos`) by default. For example, the error
during SINGA execution can be found at:
/tmp/mesos/slaves/xxxxx-Sx/frameworks/xxxxx/executors/SINGA_x/runs/latest/stderr
Other artifacts, like files downloaded from HDFS (`job.conf`) and `stdout` can be found in the same
directory.
#### Stopping
There are two way to kill the running job:
1. If the scheduler is running in the foreground, simply kill it (using `Ctrl-C`, for example).
2. If the scheduler is running in the background, kill it using Mesos's REST API:
curl -d "frameworkId=XXX-XXX-XXX-XXX-XXX-XXX" -X POST http://<master>/master/shutdown