devdocs/gpu-backend.md - systemds - Git at Google

 <!--
 {% comment %}
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to you under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

 http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
 {% endcomment %}
 -->

 # Initial prototype for GPU backend

 The GPU backend implements two important abstract classes:
 1. `org.apache.sysml.runtime.controlprogram.context.GPUContext`
 2. `org.apache.sysml.runtime.controlprogram.context.GPUObject`

 The `GPUContext` is responsible for GPU memory management and initialization/destruction of Cuda handles.
 Currently, an active instance of the `GPUContext` class is made available globally and is used to store handles
 of the allocated blocks on the GPU. A count is kept per block for the number of instructions that need it.
 When the count is 0, the block may be evicted on a call to `GPUObject.evict()`.

 A `GPUObject` (like RDDObject and BroadcastObject) is stored in CacheableData object. It gets call-backs from SystemDS's bufferpool on following methods
 1. void acquireDeviceRead()
 2. void acquireDeviceModifyDense()
 3. void acquireDeviceModifySparse
 4. void acquireHostRead()
 5. void acquireHostModify()
 6. void releaseInput()
 7. void releaseOutput()

 Sparse matrices on GPU are represented in `CSR` format. In the SystemDS runtime, they are represented in `MCSR` or modified `CSR` format.
 A conversion cost is incurred when sparse matrices are sent back and forth between host and device memory.

 Concrete classes `JCudaContext` and `JCudaObject` (which extend `GPUContext` & `GPUObject` respectively) contain references to `org.jcuda.*`.

 The `LibMatrixCUDA` class contains methods to invoke CUDA libraries (where available) and invoke custom kernels.
 Runtime classes (that extend `GPUInstruction`) redirect calls to functions in this class.
 Some functions in `LibMatrixCUDA` need finer control over GPU memory management primitives. These are provided by `JCudaObject`.

 ### Setup instructions:

 1. Follow the instructions from `https://developer.nvidia.com/cuda-downloads` and install CUDA 8.0.
 2. Follow the instructions from `https://developer.nvidia.com/cudnn` and install CuDNN v5.1.

 To use SystemDS's GPU backend when using the jar or uber-jar
 1. Add JCuda's jar into the classpath.
 2. Use `-gpu` flag.

 For example: to use GPU backend in standalone mode:
 ```bash
 java -classpath $JAR_PATH:systemml-1.0.0-SNAPSHOT-standalone.jar org.apache.sysml.api.DMLScript -f MyDML.dml -gpu -exec singlenode ...
 ```
	<!--
	{% comment %}
	Licensed to the Apache Software Foundation (ASF) under one or more
	contributor license agreements. See the NOTICE file distributed with
	this work for additional information regarding copyright ownership.
	The ASF licenses this file to you under the Apache License, Version 2.0
	(the "License"); you may not use this file except in compliance with
	the License. You may obtain a copy of the License at

	http://www.apache.org/licenses/LICENSE-2.0

	Unless required by applicable law or agreed to in writing, software
	distributed under the License is distributed on an "AS IS" BASIS,
	WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
	See the License for the specific language governing permissions and
	limitations under the License.
	{% endcomment %}
	-->

	# Initial prototype for GPU backend

	The GPU backend implements two important abstract classes:
	1. `org.apache.sysml.runtime.controlprogram.context.GPUContext`
	2. `org.apache.sysml.runtime.controlprogram.context.GPUObject`

	The `GPUContext` is responsible for GPU memory management and initialization/destruction of Cuda handles.
	Currently, an active instance of the `GPUContext` class is made available globally and is used to store handles
	of the allocated blocks on the GPU. A count is kept per block for the number of instructions that need it.
	When the count is 0, the block may be evicted on a call to `GPUObject.evict()`.

	A `GPUObject` (like RDDObject and BroadcastObject) is stored in CacheableData object. It gets call-backs from SystemDS's bufferpool on following methods
	1. void acquireDeviceRead()
	2. void acquireDeviceModifyDense()
	3. void acquireDeviceModifySparse
	4. void acquireHostRead()
	5. void acquireHostModify()
	6. void releaseInput()
	7. void releaseOutput()

	Sparse matrices on GPU are represented in `CSR` format. In the SystemDS runtime, they are represented in `MCSR` or modified `CSR` format.
	A conversion cost is incurred when sparse matrices are sent back and forth between host and device memory.

	Concrete classes `JCudaContext` and `JCudaObject` (which extend `GPUContext` & `GPUObject` respectively) contain references to `org.jcuda.*`.

	The `LibMatrixCUDA` class contains methods to invoke CUDA libraries (where available) and invoke custom kernels.
	Runtime classes (that extend `GPUInstruction`) redirect calls to functions in this class.
	Some functions in `LibMatrixCUDA` need finer control over GPU memory management primitives. These are provided by `JCudaObject`.

	### Setup instructions:

	1. Follow the instructions from `https://developer.nvidia.com/cuda-downloads` and install CUDA 8.0.
	2. Follow the instructions from `https://developer.nvidia.com/cudnn` and install CuDNN v5.1.

	To use SystemDS's GPU backend when using the jar or uber-jar
	1. Add JCuda's jar into the classpath.
	2. Use `-gpu` flag.

	For example: to use GPU backend in standalone mode:
	```bash
	java -classpath $JAR_PATH:systemml-1.0.0-SNAPSHOT-standalone.jar org.apache.sysml.api.DMLScript -f MyDML.dml -gpu -exec singlenode ...
	```