tool/mesos - singa - Git at Google

tree: f9b2f5f39fecff54421512f39461bc3ede451311 [path history] [tgz]

tool/mesos/README.md

This guide explains how to start SINGA distributed training on a Mesos cluster. It assumes that both Mesos and HDFS are already running, and every node has SINGA installed.

For this guide, we assume the following architecture, in which a cluster is set up from a set of containers (when the cluster is already set up, the user can immediate start the training). Refer to Docker guide for details of how to start individual nodes and set up network connection between them (make sure weave is running at each node, and the cluster's headnode is running in container node0)

Start HDFS and Mesos

Go inside each container, using:

docker exec -it nodeX /bin/bash

and configure it as follows:

On container node0

  hadoop namenode -format
  hadoop-daemon.sh start namenode
  /opt/mesos-0.22.0/build/bin/mesos-master.sh --work_dir=/opt --log_dir=/opt --quiet > /dev/null &
  zk-service.sh start

On container node1, node2, ...

  hadoop-daemon.sh start datanode
  /opt/mesos-0.22.0/build/bin/mesos-slave.sh --master=node0:5050 --log_dir=/opt --quiet > /dev/null &

To check if the setup has been successful, check that HDFS namenode has registered N datanodes, via:

hadoop dfsadmin -report

Mesos logs

Mesos logs are stored at /opt/lt-mesos-master.INFO on node0 and /opt/lt-mesos-slave.INFO at other nodes.

Starting SINGA training on Mesos

Assumed that Mesos and HDFS are already started, SINGA job can be launched at any container.

Launching job

2. Check that configuration files are correct:

scheduler.conf contains information about the master nodes
singa.conf contains information about Zookeeper node
Job configuration file job.conf contains full path to the examples directories (NO RELATIVE PATH!).

Start the job:

If starting for the first time:

    ./scheduler <job config file> -scheduler_conf <scheduler config file> -singa_conf <SINGA config file>

If not the first time:
```
    ./scheduler <job config file>
```

Notes. Each running job is given a frameworkID. Look for the log message of the form:

         Framework registered with XXX-XXX-XXX-XXX-XXX-XXX

Monitoring and Debugging

Each Mesos job is given a frameworkID and a sandbox directory is created for each job. The directory is in the specified work_dir (or /tmp/mesos) by default. For example, the error during SINGA execution can be found at:

        /tmp/mesos/slaves/xxxxx-Sx/frameworks/xxxxx/executors/SINGA_x/runs/latest/stderr

Other artifacts, like files downloaded from HDFS (job.conf) and stdout can be found in the same directory.

Stopping

There are two way to kill the running job:

If the scheduler is running in the foreground, simply kill it (using Ctrl-C, for example).

If the scheduler is running in the background, kill it using Mesos's REST API:

   curl -d "frameworkId=XXX-XXX-XXX-XXX-XXX-XXX" -X POST http://<master>/master/shutdown