####List of Resources
Suppose that we are now implementing a Multi-GPU learning program. One way to do that is through data parallelism. We can launch many threads, with each thread compute gradient on one GPU, and aggregate the statistics together. However, the gradient synchronization step could be cost time, and in many cases, we can do the computation in an smarter way, so that we overlaps the computation with the synchronization.
mshadow-ps provides interface to do such synchronization in an easy way. The following documents provides a way
We first get familiar with the interface of mshadow-mshadow_ps. Through the following program in local_sum-inl.h. You can compile the program by setup the config.mk according to your computers's enviroment, and type make.
In the following program, each thread first does some computation locally, then tries to get the sum of data
through mshadow-ps interface. There are four key functions in ISharedModel
interface
// this function is runed by specific thread template<typename xpu> inline void RunWorkerThread(int devid, mshadow::ps::ISharedModel<xpu, float> *ps) { // initialize tensor engine mshadow::InitTensorEngine<xpu>(devid); mshadow::Stream<xpu> *stream = mshadow::NewStream<xpu>(); // allocate tensor on xpu mshadow::TensorContainer<xpu, 2> data(mshadow::Shape2(2, 3)); // set the computation stream to the new allocated stream // this will make subsequent computation whose target is data // to use the stream, stream is needed for async execution in GPU data.set_stream(stream); // assume these operations sets the content of dataient data[0] = 1.0f; data[1] = devid + data[0]; printf("dev%d: before sync, data:\n", devid); // use print to show result, do not call // print normally since Copy will block Print(data); printf("====================\n"); // intiaialize the key, register the shape on parameter server ps->InitKey(data[0].shape_, 0, devid); ps->InitKey(data[1].shape_, 1, devid); // push data[0] out, for update, or aggregation // 0 is the key of the data, devid is the current device id ps->Push(data[0], 0, devid); // pull request is used to request the data to be copied back // once computation is done ps->PullReq(data[0], 0, devid); // computation can be done here.. // the pull request handler will be overlapped with // similar as previous call ps->Push(data[1], 1, devid); ps->PullReq(data[1], 1, devid); // more computation can be done here... // the computation will be overlapped // PullWait will block until these request finishes ps->PullWait(0, devid); ps->PullWait(1, devid); printf("dev%d: after sync, data:\n", devid); // use print to show result, do not call // print normally since Copy will block Print(data); printf("====================\n"); mshadow::DeleteStream(stream); mshadow::ShutdownTensorEngine<xpu>(); } template<typename xpu> inline int Run(int argc, char *argv[]) { if (argc < 2) { printf("Usage: device list\n"\ "\tfor CPU the device list can be arbitrary\n"\ "\tfor GPU the device list need to be actual device index\n"); return 0; } // list of device ids std::vector<int> devs; // initialization for (int i = 1; i < argc; ++i) { // record the device id devs.push_back(atoi(argv[i])); } mshadow::ps::ISharedModel<xpu, float> *ps = mshadow::ps::CreateSharedModel<xpu, float>("local"); // intiaialize the ps ps->Init(devs); // use openmp to launch #devs threads #pragma omp parallel num_threads(devs.size()) { int tid = omp_get_thread_num(); RunWorkerThread<xpu>(devs[tid], ps); } delete ps; return 0; }
In the above example, we did not do weight update on server side, so the synchronization result is simply the sum of data on each device. The key property of this interface is that the Push and PullReq are asynchronize.
To get a more concrete understanding of the interface. We give an example of multi-GPU two layer neuralnet in ../neuralnet/nnet_ps.cu. The general idea is follows
There are several note of the mshadow-ps on the neural net code
We should note thate because the example runs on MNIST, which is an quite small dataset, you may not observe speedup with multiple cards. However, you will find significant speedup when you run on other tasks. The newest version of cxxnet
In all the examples so far, we use mshadow-ps to get the aggregated sum of gradients, and update weights locally on each GPU. For more advanced usage of mshadow-ps, we can move the weight update to the server. The communication pattern is as follows
Such update pattern is suitable under distributed setting. To do so, user need to implement an IModelUpdater interface. And define the following CreateModelUpdater function in the program
namespace mshadow { namespace ps { template<> IModelUpdater<float> *CreateModelUpdater() { return new MyModelUpdater(); } } }
Before calling ISharedModel.Init, user need to call ps->SetParam("update_on_server", "1")
to set the update mode on the server side. If user uses distributed shared model, user must define ModelUpdater.
First build the parameter server (replace ps_dir
to any convenient directory)
git clone https://github.com/dmlc/parameter_server -b dev ps_dir cd ps_dir ./script/install_third.sh make -j8
Next change config.mk
to
USE_DIST_PS = 1 PS_PATH = ps_dir
Then make
.
Next start 1 server node, 3 worker nodes with 2 devices in each worker node:
./local.sh 1 3 ./dist_async_sum.cpu 1 2
The dist_async_sum-inl.h
is similar to local_sum-inl.h
. The main differences are 1) we create the server at a remote node, and set update_on_server
to be true.
auto* ps = mshadow::ps::CreateSharedModel<xpu, float>("dist"); ps->SetParam("update_on_server", "1");
dist_async_sum.cpp