content/v0.3.0/en/_sources/docs/gpu.txt - singa-site - Git at Google

 # Training on GPU

 ---

 Considering GPU is much faster than CPU for linear algebra operations,
 it is essential to support the training of deep learning models (which involves
 a lot of linear algebra operations) on GPU cards.
 SINGA now supports training on a single node (i.e., process) with multiple GPU
 cards. Training in a GPU cluster with multiple nodes is under development.

 ## Instructions

 ### Compilation
 To enable the training on GPU, you need to compile SINGA with [CUDA](http://www.nvidia.com/object/cuda_home_new.html) from Nvidia,

     ./configure --enable-cuda --with-cuda=<path to cuda folder>

 In addition, if you want to use the [CUDNN library](https://developer.nvidia.com/cudnn) for convolutional neural network
 provided by Nvidia, you need to enable CUDNN,


     ./configure --enable-cuda --with-cuda=<path to cuda folder> --enable-cudnn --with-cudnn=<path to cudnn folder>

 SINGA now supports CUDNN V3 and V4.


 ### Configuration

 The job configuration for GPU training is similar to that for training on CPU.
 There is one more field to configure, `gpu`, which indicate the device ID of
 the GPU you want to use. The simplest configuration is


     # job.conf
     ...
     gpu: 0
     ...


 #### Single node with multiple GPUs
 This configuration will run the worker on GPU 0. If you want to launch multiple
 workers, each on a separate GPU, you can configure it as

     # job.conf
     ...
     gpu: 0
     gpu: 2
     ...
     cluster {
       nworkers_per_group: 2
       nworkers_per_process: 2
     }

 Using the above configuration, SINGA would partition each mini-batch evenly
 onto two workers which run on GPU 0 and GPU 2 respectively. For more information
 on running multiple workers in a single node, please refer to
 [Training Framework](frameworks.html). Please be careful to configure the same number
 of workers and number of `gpu`s. Otherwise some workers would run on GPU and the
 rest would run on CPU. This kind of hybrid training is not well supported for now.


 For some layers, their implementation is transparent to GPU/CPU, like the InnerProductLayer
 GRULayer, ReLULayer, etc. Hence, you can use the same configuration for these layers to run
 on GPU or CPU. For other layers, especially the layers involved in ConvNet, SINGA
 uses different implementations for GPU and CPU. Particularly, the GPU version is
 implemented using CUDNN library. To train a ConvNet on GPU, you configure the layers as

     layer {
       type: kCudnnConv
       ...
     }
     layer {
       type: kCudnnPool
       ...
     }

 The [cifar10 example](cnn.html) and [Alexnet example](alexnet.html) have complete
 configurations for ConvNet.

 #### GPU cluster
 For distributed training over a (GPU) cluster, you just need to configure SINGA with
 `--enable-dist`, which would then compile SINGA with zookeeper and ZeroMQ.

 ## Implementation details

 SINGA implements the GPU training by assigning each worker a GPU device at the beginning
 of training (by the Driver class). Then the work can call GPU functions and run them on the
 assigned GPU. GPU is typically used for linear algebra computation in layer
 functions, because GPU is good at such computation. There is a [Context]() singleton,
 which stores the handles and random generators for each device. The layer code
 should detect its running device and then call the CPU or GPU functions correspondingly.

 To make the layer implementation easier
 SINGA provides some linear algebra functions (in *math_blob.h*), which are transparent to the running
 device for users. Internally, they query the Context singleton to get the device information
 and call CPU or GPU to do the computation. Consequently, users can implement
 layers without awareness of the underlying running device.

 If the functionality cannot be implemented using SINGA provided functions in
 *math_blob.h*, the layer code needs to handle the CPU and GPU devices explicitly
 by querying the Context singleton.  For layers that cannot run on GPU, e.g.,
 input/output layers and connection layers which have little computation but much
 IO or network workload, there is no need to consider the GPU device.
 When these layers are configured in a neural net, they will run on CPU (since
 they don't call GPU functions).
	# Training on GPU

	---

	Considering GPU is much faster than CPU for linear algebra operations,
	it is essential to support the training of deep learning models (which involves
	a lot of linear algebra operations) on GPU cards.
	SINGA now supports training on a single node (i.e., process) with multiple GPU
	cards. Training in a GPU cluster with multiple nodes is under development.

	## Instructions

	### Compilation
	To enable the training on GPU, you need to compile SINGA with [CUDA](http://www.nvidia.com/object/cuda_home_new.html) from Nvidia,

	./configure --enable-cuda --with-cuda=<path to cuda folder>

	In addition, if you want to use the [CUDNN library](https://developer.nvidia.com/cudnn) for convolutional neural network
	provided by Nvidia, you need to enable CUDNN,


	./configure --enable-cuda --with-cuda=<path to cuda folder> --enable-cudnn --with-cudnn=<path to cudnn folder>

	SINGA now supports CUDNN V3 and V4.


	### Configuration

	The job configuration for GPU training is similar to that for training on CPU.
	There is one more field to configure, `gpu`, which indicate the device ID of
	the GPU you want to use. The simplest configuration is


	# job.conf
	...
	gpu: 0
	...


	#### Single node with multiple GPUs
	This configuration will run the worker on GPU 0. If you want to launch multiple
	workers, each on a separate GPU, you can configure it as

	# job.conf
	...
	gpu: 0
	gpu: 2
	...
	cluster {
	nworkers_per_group: 2
	nworkers_per_process: 2
	}

	Using the above configuration, SINGA would partition each mini-batch evenly
	onto two workers which run on GPU 0 and GPU 2 respectively. For more information
	on running multiple workers in a single node, please refer to
	[Training Framework](frameworks.html). Please be careful to configure the same number
	of workers and number of `gpu`s. Otherwise some workers would run on GPU and the
	rest would run on CPU. This kind of hybrid training is not well supported for now.


	For some layers, their implementation is transparent to GPU/CPU, like the InnerProductLayer
	GRULayer, ReLULayer, etc. Hence, you can use the same configuration for these layers to run
	on GPU or CPU. For other layers, especially the layers involved in ConvNet, SINGA
	uses different implementations for GPU and CPU. Particularly, the GPU version is
	implemented using CUDNN library. To train a ConvNet on GPU, you configure the layers as

	layer {
	type: kCudnnConv
	...
	}
	layer {
	type: kCudnnPool
	...
	}

	The [cifar10 example](cnn.html) and [Alexnet example](alexnet.html) have complete
	configurations for ConvNet.

	#### GPU cluster
	For distributed training over a (GPU) cluster, you just need to configure SINGA with
	`--enable-dist`, which would then compile SINGA with zookeeper and ZeroMQ.

	## Implementation details

	SINGA implements the GPU training by assigning each worker a GPU device at the beginning
	of training (by the Driver class). Then the work can call GPU functions and run them on the
	assigned GPU. GPU is typically used for linear algebra computation in layer
	functions, because GPU is good at such computation. There is a [Context]() singleton,
	which stores the handles and random generators for each device. The layer code
	should detect its running device and then call the CPU or GPU functions correspondingly.

	To make the layer implementation easier
	SINGA provides some linear algebra functions (in math_blob.h), which are transparent to the running
	device for users. Internally, they query the Context singleton to get the device information
	and call CPU or GPU to do the computation. Consequently, users can implement
	layers without awareness of the underlying running device.

	If the functionality cannot be implemented using SINGA provided functions in
	math_blob.h, the layer code needs to handle the CPU and GPU devices explicitly
	by querying the Context singleton. For layers that cannot run on GPU, e.g.,
	input/output layers and connection layers which have little computation but much
	IO or network workload, there is no need to consider the GPU device.
	When these layers are configured in a neural net, they will run on CPU (since
	they don't call GPU functions).