| # Training on GPU |
| |
| --- |
| |
| Considering GPU is much faster than CPU for linear algebra operations, |
| it is essential to support the training of deep learning models (which involves |
| a lot of linear algebra operations) on GPU cards. |
| SINGA now supports training on a single node (i.e., process) with multiple GPU |
| cards. Training in a GPU cluster with multiple nodes is under development. |
| |
| ## Instructions |
| |
| ### Compilation |
| To enable the training on GPU, you need to compile SINGA with [CUDA](http://www.nvidia.com/object/cuda_home_new.html) from Nvidia, |
| |
| ./configure --enable-cuda --with-cuda=<path to cuda folder> |
| |
| In addition, if you want to use the [CUDNN library](https://developer.nvidia.com/cudnn) for convolutional neural network |
| provided by Nvidia, you need to enable CUDNN, |
| |
| |
| ./configure --enable-cuda --with-cuda=<path to cuda folder> --enable-cudnn --with-cudnn=<path to cudnn folder> |
| |
| SINGA now supports CUDNN V3 and V4. |
| |
| |
| ### Configuration |
| |
| The job configuration for GPU training is similar to that for training on CPU. |
| There is one more field to configure, `gpu`, which indicate the device ID of |
| the GPU you want to use. The simplest configuration is |
| |
| |
| # job.conf |
| ... |
| gpu: 0 |
| ... |
| |
| |
| #### Single node with multiple GPUs |
| This configuration will run the worker on GPU 0. If you want to launch multiple |
| workers, each on a separate GPU, you can configure it as |
| |
| # job.conf |
| ... |
| gpu: 0 |
| gpu: 2 |
| ... |
| cluster { |
| nworkers_per_group: 2 |
| nworkers_per_process: 2 |
| } |
| |
| Using the above configuration, SINGA would partition each mini-batch evenly |
| onto two workers which run on GPU 0 and GPU 2 respectively. For more information |
| on running multiple workers in a single node, please refer to |
| [Training Framework](frameworks.html). Please be careful to configure the same number |
| of workers and number of `gpu`s. Otherwise some workers would run on GPU and the |
| rest would run on CPU. This kind of hybrid training is not well supported for now. |
| |
| |
| For some layers, their implementation is transparent to GPU/CPU, like the InnerProductLayer |
| GRULayer, ReLULayer, etc. Hence, you can use the same configuration for these layers to run |
| on GPU or CPU. For other layers, especially the layers involved in ConvNet, SINGA |
| uses different implementations for GPU and CPU. Particularly, the GPU version is |
| implemented using CUDNN library. To train a ConvNet on GPU, you configure the layers as |
| |
| layer { |
| type: kCudnnConv |
| ... |
| } |
| layer { |
| type: kCudnnPool |
| ... |
| } |
| |
| The [cifar10 example](cnn.html) and [Alexnet example](alexnet.html) have complete |
| configurations for ConvNet. |
| |
| #### GPU cluster |
| For distributed training over a (GPU) cluster, you just need to configure SINGA with |
| `--enable-dist`, which would then compile SINGA with zookeeper and ZeroMQ. |
| |
| ## Implementation details |
| |
| SINGA implements the GPU training by assigning each worker a GPU device at the beginning |
| of training (by the Driver class). Then the work can call GPU functions and run them on the |
| assigned GPU. GPU is typically used for linear algebra computation in layer |
| functions, because GPU is good at such computation. There is a [Context]() singleton, |
| which stores the handles and random generators for each device. The layer code |
| should detect its running device and then call the CPU or GPU functions correspondingly. |
| |
| To make the layer implementation easier |
| SINGA provides some linear algebra functions (in *math_blob.h*), which are transparent to the running |
| device for users. Internally, they query the Context singleton to get the device information |
| and call CPU or GPU to do the computation. Consequently, users can implement |
| layers without awareness of the underlying running device. |
| |
| If the functionality cannot be implemented using SINGA provided functions in |
| *math_blob.h*, the layer code needs to handle the CPU and GPU devices explicitly |
| by querying the Context singleton. For layers that cannot run on GPU, e.g., |
| input/output layers and connection layers which have little computation but much |
| IO or network workload, there is no need to consider the GPU device. |
| When these layers are configured in a neural net, they will run on CPU (since |
| they don't call GPU functions). |
| |