| # Quick Start |
| |
| --- |
| |
| ## SINGA setup |
| |
| Please refer to the [installation](installation.html) page for guidance on installing SINGA. |
| |
| |
| ### Training on a single node |
| |
| For single node training, one process will be launched to run SINGA at |
| local host. We train the [CNN model](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) over the |
| [CIFAR-10](http://www.cs.toronto.edu/~kriz/cifar.html) dataset as an example. |
| The hyper-parameters are set following |
| [cuda-convnet](https://code.google.com/p/cuda-convnet/). More details is |
| available at [CNN example](cnn.html). |
| |
| |
| #### Preparing data and job configuration |
| |
| Download the dataset and create the data shards for training and testing. |
| |
| cd examples/cifar10/ |
| cp Makefile.example Makefile |
| make download |
| make create |
| |
| A training dataset and a test dataset are created respectively. An *image_mean.bin* file is also |
| generated, which contains the feature mean of all images. |
| |
| Since all code used for training this CNN model is provided by SINGA as |
| built-in implementation, there is no need to write any code. Instead, users just |
| execute the running script by providing the job |
| configuration file (*job.conf*). To code in SINGA, please refer to the |
| [programming guide](programming-guide.html). |
| |
| #### Training without parallelism |
| |
| By default, the cluster topology has a single worker and a single server. |
| In other words, neither the training data nor the neural net is partitioned. |
| |
| The training is started by running: |
| |
| # goto top level folder |
| cd ../../ |
| ./singa -conf examples/cifar10/job.conf |
| |
| #### Asynchronous parallel training |
| |
| # job.conf |
| ... |
| cluster { |
| nworker_groups: 2 |
| nworkers_per_procs: 2 |
| workspace: "examples/cifar10/" |
| } |
| |
| In SINGA, [asynchronous training](architecture.html) is enabled by launching |
| multiple worker groups. For example, we can change the original *job.conf* to |
| have two worker groups as shown above. By default, each worker group has one |
| worker. Since one process is set to contain two workers. The two worker groups |
| will run in the same process. Consequently, they run the in-memory |
| [Downpour](frameworks.html) training framework. Users do not need to split the |
| dataset explicitly for each worker (group); instead, they can assign each |
| worker (group) a random offset to the start of the dataset. The workers would |
| run as on different data partitions. |
| |
| # job.conf |
| ... |
| neuralnet { |
| layer { |
| ... |
| store_conf { |
| random_skip: 5000 |
| } |
| } |
| ... |
| } |
| |
| The running command is: |
| |
| ./singa -conf examples/cifar10/job.conf |
| |
| #### Synchronous parallel training |
| |
| # job.conf |
| ... |
| cluster { |
| nworkers_per_group: 2 |
| nworkers_per_procs: 2 |
| workspace: "examples/cifar10/" |
| } |
| |
| In SINGA, [asynchronous training](architecture.html) is enabled |
| by launching multiple workers within one worker group. For instance, we can |
| change the original *job.conf* to have two workers in one worker group as shown |
| above. The workers will run synchronously |
| as they are from the same worker group. This framework is the in-memory |
| [sandblaster](frameworks.html). |
| The model is partitioned among the two workers. In specific, each layer is |
| sliced over the two workers. The sliced layer |
| is the same as the original layer except that it only has `B/g` feature |
| instances, where `B` is the number of instances in a mini-batch, `g` is the number of |
| workers in a group. It is also possible to partition the layer (or neural net) |
| using [other schemes](neural-net.html). |
| All other settings are the same as running without partitioning |
| |
| ./singa -conf examples/cifar10/job.conf |
| |
| |
| ### Training in a cluster |
| |
| #### Starting Zookeeper |
| |
| SINGA uses [zookeeper](https://zookeeper.apache.org/) to coordinate the |
| training, and uses ZeroMQ for transferring messages. After installing zookeeper |
| and ZeroMQ, you need to configure SINGA with `--enable-dist` before compiling. |
| Please make sure the zookeeper service is started before running SINGA. |
| |
| If you installed the zookeeper using our thirdparty script, you can |
| simply start it by: |
| |
| #goto top level folder |
| cd SINGA_ROOT |
| ./bin/zk-service.sh start |
| |
| (`./bin/zk-service.sh stop` stops the zookeeper). |
| |
| Otherwise, if you launched a zookeeper by yourself but not used the |
| default port, please edit the `conf/singa.conf`: |
| |
| zookeeper_host: "localhost:YOUR_PORT" |
| |
| |
| We can extend the above two training frameworks to a cluster by updating the |
| cluster configuration with: |
| |
| nworker_per_procs: 1 |
| |
| Every process would then create only one worker thread. Consequently, the workers |
| would be created in different processes (i.e., nodes). The *hostfile* |
| must be provided under *SINGA_ROOT/conf/* specifying the nodes in the cluster, |
| e.g., |
| |
| 192.168.0.1 |
| 192.168.0.2 |
| |
| And the zookeeper location must be configured correctly, e.g., |
| |
| #conf/singa.conf |
| zookeeper_host: "logbase-a01" |
| |
| The running command is : |
| |
| ./bin/singa-run.sh -conf examples/cifar10/job.conf |
| |
| You can list the current running jobs by, |
| |
| ./bin/singa-console.sh list |
| |
| JOB ID |NUM PROCS |
| ----------|----------- |
| 24 |2 |
| |
| Jobs can be killed by, |
| |
| ./bin/singa-console.sh kill JOB_ID |
| |
| |
| Logs and job information are available in */tmp/singa-log* folder, which can be |
| changed to other folders by setting `log-dir` in *conf/singa.conf*. |
| |
| ### Training with GPUs |
| Please refer to the [GPU page][gpu.html] for details on training using GPUs. |
| |
| ## Where to go next |
| |
| The [programming guide](programming-guide.html) pages will |
| describe how to submit a training job in SINGA. |