blob: 18498494989e1550b08071df40a211af2b8c700a [file] [log] [blame]
# Quick Start
---
## SINGA setup
Please refer to the [installation](installation.html) page for guidance on installing SINGA.
### Training on a single node
For single node training, one process will be launched to run SINGA at
local host. We train the [CNN model](http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks) over the
[CIFAR-10](http://www.cs.toronto.edu/~kriz/cifar.html) dataset as an example.
The hyper-parameters are set following
[cuda-convnet](https://code.google.com/p/cuda-convnet/). More details is
available at [CNN example](cnn.html).
#### Preparing data and job configuration
Download the dataset and create the data shards for training and testing.
cd examples/cifar10/
cp Makefile.example Makefile
make download
make create
A training dataset and a test dataset are created respectively. An *image_mean.bin* file is also
generated, which contains the feature mean of all images.
Since all code used for training this CNN model is provided by SINGA as
built-in implementation, there is no need to write any code. Instead, users just
execute the running script by providing the job
configuration file (*job.conf*). To code in SINGA, please refer to the
[programming guide](programming-guide.html).
#### Training without parallelism
By default, the cluster topology has a single worker and a single server.
In other words, neither the training data nor the neural net is partitioned.
The training is started by running:
# goto top level folder
cd ../../
./singa -conf examples/cifar10/job.conf
#### Asynchronous parallel training
# job.conf
...
cluster {
nworker_groups: 2
nworkers_per_procs: 2
workspace: "examples/cifar10/"
}
In SINGA, [asynchronous training](architecture.html) is enabled by launching
multiple worker groups. For example, we can change the original *job.conf* to
have two worker groups as shown above. By default, each worker group has one
worker. Since one process is set to contain two workers. The two worker groups
will run in the same process. Consequently, they run the in-memory
[Downpour](frameworks.html) training framework. Users do not need to split the
dataset explicitly for each worker (group); instead, they can assign each
worker (group) a random offset to the start of the dataset. The workers would
run as on different data partitions.
# job.conf
...
neuralnet {
layer {
...
store_conf {
random_skip: 5000
}
}
...
}
The running command is:
./singa -conf examples/cifar10/job.conf
#### Synchronous parallel training
# job.conf
...
cluster {
nworkers_per_group: 2
nworkers_per_procs: 2
workspace: "examples/cifar10/"
}
In SINGA, [asynchronous training](architecture.html) is enabled
by launching multiple workers within one worker group. For instance, we can
change the original *job.conf* to have two workers in one worker group as shown
above. The workers will run synchronously
as they are from the same worker group. This framework is the in-memory
[sandblaster](frameworks.html).
The model is partitioned among the two workers. In specific, each layer is
sliced over the two workers. The sliced layer
is the same as the original layer except that it only has `B/g` feature
instances, where `B` is the number of instances in a mini-batch, `g` is the number of
workers in a group. It is also possible to partition the layer (or neural net)
using [other schemes](neural-net.html).
All other settings are the same as running without partitioning
./singa -conf examples/cifar10/job.conf
### Training in a cluster
#### Starting Zookeeper
SINGA uses [zookeeper](https://zookeeper.apache.org/) to coordinate the
training, and uses ZeroMQ for transferring messages. After installing zookeeper
and ZeroMQ, you need to configure SINGA with `--enable-dist` before compiling.
Please make sure the zookeeper service is started before running SINGA.
If you installed the zookeeper using our thirdparty script, you can
simply start it by:
#goto top level folder
cd SINGA_ROOT
./bin/zk-service.sh start
(`./bin/zk-service.sh stop` stops the zookeeper).
Otherwise, if you launched a zookeeper by yourself but not used the
default port, please edit the `conf/singa.conf`:
zookeeper_host: "localhost:YOUR_PORT"
We can extend the above two training frameworks to a cluster by updating the
cluster configuration with:
nworker_per_procs: 1
Every process would then create only one worker thread. Consequently, the workers
would be created in different processes (i.e., nodes). The *hostfile*
must be provided under *SINGA_ROOT/conf/* specifying the nodes in the cluster,
e.g.,
192.168.0.1
192.168.0.2
And the zookeeper location must be configured correctly, e.g.,
#conf/singa.conf
zookeeper_host: "logbase-a01"
The running command is :
./bin/singa-run.sh -conf examples/cifar10/job.conf
You can list the current running jobs by,
./bin/singa-console.sh list
JOB ID |NUM PROCS
----------|-----------
24 |2
Jobs can be killed by,
./bin/singa-console.sh kill JOB_ID
Logs and job information are available in */tmp/singa-log* folder, which can be
changed to other folders by setting `log-dir` in *conf/singa.conf*.
### Training with GPUs
Please refer to the [GPU page][gpu.html] for details on training using GPUs.
## Where to go next
The [programming guide](programming-guide.html) pages will
describe how to submit a training job in SINGA.