Merge pull request #34 from SeanCho1996/master add simplified chinese documentation

commit: 21a94f74c98a4d6a8940ff8392c566e0c5285a38 [log] [tgz]
author: Chris Yeung <38325429+chrishkchris@users.noreply.github.com> Tue Nov 17 20:29:30 2020 +0800
committer: GitHub <noreply@github.com> Tue Nov 17 20:29:30 2020 +0800
tree: e0a74df559b048064e216e2ac83ae2d36d9d32a8
parent: f6d21dd393e5eab6ef3c366aa8979f6c14393369 [diff]
parent: 758a8b7dae916d57dd651c035995cba36f04d1be [diff]
diff --git a/docs-site/docs/half-precision.md b/docs-site/docs/half-precision.md
new file mode 100644
index 0000000..126ae70
--- /dev/null
+++ b/docs-site/docs/half-precision.md

@@ -0,0 +1,104 @@
+---
+id: half-precision
+title: Half Precision
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+Half precision training could bring benefits:
+- using less GPU memory, supporting larger network. 
+- training faster. 
+
+## Half data type
+
+### Half data type definition
+The IEEE 754 standard specifies a binary16 as having the following
+ [format](https://en.wikipedia.org/wiki/Half-precision_floating-point_format):
+Sign bit: 1 bit
+Exponent width: 5 bits
+Significand precision: 11 bits (10 explicitly stored)
+
+### Half data type operation
+Load data in fp32 and easily convert to fp16 by casting.
+```python
+>>> from singa import tensor, device
+>>> dev = device.create_cuda_gpu()
+>>> x = tensor.random((2,3),dev)
+>>> x
+[[0.7703407  0.42764223 0.5872884 ]
+ [0.78362167 0.70469785 0.64975065]], float32
+>>> y = x.as_type(tensor.float16)
+>>> y
+[[0.7705 0.4277 0.5874]
+ [0.7837 0.7046 0.65  ]], float16
+```
+
+Primary operations are supported in fp16.
+```python
+>>> y+y
+[[1.541  0.8555 1.175 ]
+ [1.567  1.409  1.3   ]], float16
+```
+
+## Training in Half
+
+### Training in Half three step
+Training in half precision could be done easily in three steps:
+1. Load data and convert to half
+2. Set data type of optimizer
+3. Train model as usual
+``` python
+# cast input data to fp16
+x = load_data()
+x = x.astype(np.float16)
+tx = tensor.from_numpy(x)
+
+# load model
+model = build_model()
+# set optimizer dtype to fp16
+sgd = opt.SGD(lr=0.1, dtype=tensor.float16)
+
+# train as usual
+out, loss = model(tx, ty)
+```
+
+### Example
+An example script is `train_cnn.py`, run below command to train in half.
+```python
+python examples/cnn/train_cnn.py cnn mnist -pfloat16
+```
+
+## Implementation
+
+### Half Type Dependency
+This half implementation is integrated in C++ backend as general half type 
+support.
+
+To run on GPU, `__half` is available in Cuda math API. To support `__half` 
+math operation, it is required to compile against Nvidia compute arch > 6.0
+ (Pascal).
+
+### Nvidia Hardware Acceleration: Tensor Core
+Tensor Core released by Nvidia further accelerates half precision and multiples 
+throughput for operations like GEMM(CuBlas) and convolution(CuDNN). To enable 
+Tensor core operation, there are a few restriction on GEMM dimensions, 
+convolution channel size, Cuda version, and GPU version(Turing or later) and etc.
+
+### Implement Operations
+Half operations are primarily implemented in `tensor_math_cuda.h`, by specializing
+operation template with half type and implementation the low level computation.
+
+For example, GEMM operation is implemented as:
+```c++
+template <>
+void GEMM<half_float::half, lang::Cuda>(const half_float::half alpha,
+                                        const Tensor& A, const Tensor& B,
+                                        const half_float::half beta, Tensor* C,
+                                        Context* ctx) {
+  // ...
+  CUBLAS_CHECK(cublasGemmEx(handle, transb, transa, ncolB, nrowA, ncolA,
+                           alphaPtr, BPtr, Btype, ldb, APtr, Atype, lda,
+                           betaPtr, CPtr, Ctype, ldc, computeType, algo));
+  // ...
+}
+```
\ No newline at end of file
commit	21a94f74c98a4d6a8940ff8392c566e0c5285a38	[log] [tgz]
author	Chris Yeung <38325429+chrishkchris@users.noreply.github.com>	Tue Nov 17 20:29:30 2020 +0800
committer	GitHub <noreply@github.com>	Tue Nov 17 20:29:30 2020 +0800
tree	e0a74df559b048064e216e2ac83ae2d36d9d32a8
parent	f6d21dd393e5eab6ef3c366aa8979f6c14393369 [diff]
parent	758a8b7dae916d57dd651c035995cba36f04d1be [diff]