update the website docs for v3.1.0 announcement
diff --git a/docs-site/docs/download.md b/docs-site/docs/download.md
index b2ba081..667a09a 100644
--- a/docs-site/docs/download.md
+++ b/docs-site/docs/download.md
@@ -18,12 +18,42 @@
 
 You can also check the SHA512 or MD5 values to see if the download is completed.
 
+## V3.1.0 (30 October 2020):
+
+- [Apache SINGA 3.1.0](http://www.apache.org/dyn/closer.cgi/singa/3.1.0/apache-singa-3.1.0.tar.gz)
+  [\[SHA512\]](https://www.apache.org/dist/singa/3.1.0/apache-singa-3.1.0.tar.gz.sha512)
+  [\[ASC\]](https://www.apache.org/dist/singa/3.1.0/apache-singa-3.1.0.tar.gz.asc)
+- [Release Notes 3.1.0](http://singa.apache.org/docs/releases/RELEASE_NOTES_3.1.0)
+- Major changes:
+  - Update Tensor core:
+    - Support tensor transformation (reshape, transpose) for tensors up to 6
+      dimensions.
+    - Implement traverse_unary_transform in Cuda backend, which is similar to
+      CPP backend one.
+  - Add new tensor operators into the autograd module.
+  - Reconstruct sonnx to
+    - Support creating operators from both layer and autograd.
+    - Re-write SingaRep to provide a more powerful intermediate representation
+      of SINGA.
+    - Add a SONNXModel which implements from Model to provide uniform API and
+      features.
+  * Replace the Travis CI with Github workflow. Add quality and coverage
+    management.
+  * Add compiling and packaging scripts to create wheel packages for
+    distribution.
+  * Fix bugs
+    - Fix IMDB LSTM model example training script.
+    - Fix Tensor operation Mult on Broadcasting use cases.
+    - Gaussian function on Tensor now can run on Tensor with odd size.
+    - Updated a testing helper function gradients() in autograd to lookup param
+      gradient by param python object id for testing purpose.
+
 ## V3.0.0 (18 April 2020):
 
-- [Apache SINGA 3.0.0](http://www.apache.org/dyn/closer.cgi/singa/3.0.0/apache-singa-3.0.0.tar.gz)
-  [\[SHA512\]](https://www.apache.org/dist/singa/3.0.0/apache-singa-3.0.0.tar.gz.sha512)
-  [\[ASC\]](https://www.apache.org/dist/singa/3.0.0/apache-singa-3.0.0.tar.gz.asc)
-- [Release Notes 3.0.0](releases/RELEASE_NOTES_3.0.0)
+- [Apache SINGA 3.0.0](https://archive.apache.org/dist/singa/3.0.0/apache-singa-3.0.0.tar.gz)
+  [\[SHA512\]](https://archive.apache.org/dist/singa/3.0.0/apache-singa-3.0.0.tar.gz.sha512)
+  [\[ASC\]](https://archive.apache.org/dist/singa/3.0.0/apache-singa-3.0.0.tar.gz.asc)
+- [Release Notes 3.0.0](http://singa.apache.org/docs/releases/RELEASE_NOTES_3.0.0)
 - New features and major changes,
   - Enhanced ONNX. Multiple ONNX models have been tested in SINGA.
   - Distributed training with MPI and NCCL Communication optimization through
@@ -39,10 +69,10 @@
 
 ## Incubating v2.0.0 (20 April 2019):
 
-- [Apache SINGA 2.0.0 (incubating)](http://www.apache.org/dyn/closer.cgi/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz)
-  [\[SHA512\]](https://www.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz.sha512)
-  [\[ASC\]](https://www.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz.asc)
-- [Release Notes 2.0.0 (incubating)](releases/RELEASE_NOTES_2.0.0.html)
+- [Apache SINGA 2.0.0 (incubating)](https://archive.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz)
+  [\[SHA512\]](https://archive.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz.sha512)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz.asc)
+- [Release Notes 2.0.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_2.0.0.html)
 - New features and major updates,
   - Enhance autograd (for Convolution networks and recurrent networks)
   - Support ONNX
@@ -56,7 +86,7 @@
 - [Apache SINGA 1.2.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz)
   [\[SHA512\]](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz.sha512)
   [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz.asc)
-- [Release Notes 1.2.0 (incubating)](releases/RELEASE_NOTES_1.2.0.html)
+- [Release Notes 1.2.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_1.2.0.html)
 - New features and major updates,
   - Implement autograd (currently support MLP model)
   - Upgrade PySinga to support Python 3
@@ -74,7 +104,7 @@
 - [Apache SINGA 1.1.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz)
   [\[MD5\]](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz.md5)
   [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz.asc)
-- [Release Notes 1.1.0 (incubating)](releases/RELEASE_NOTES_1.1.0.html)
+- [Release Notes 1.1.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_1.1.0.html)
 - New features and major updates,
   - Create Docker images (CPU and GPU versions)
   - Create Amazon AMI for SINGA (CPU version)
@@ -98,7 +128,7 @@
 - [Apache SINGA 1.0.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz)
   [\[MD5\]](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz.md5)
   [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz.asc)
-- [Release Notes 1.0.0 (incubating)](releases/RELEASE_NOTES_1.0.0.html)
+- [Release Notes 1.0.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_1.0.0.html)
 - New features and major updates,
   - Tensor abstraction for supporting more machine learning models.
   - Device abstraction for running on different hardware devices, including CPU,
@@ -118,7 +148,7 @@
 - [Apache SINGA 0.3.0 (incubating)](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz)
   [\[MD5\]](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz.md5)
   [\[ASC\]](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz.asc)
-- [Release Notes 0.3.0 (incubating)](releases/RELEASE_NOTES_0.3.0.html)
+- [Release Notes 0.3.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_0.3.0.html)
 - New features and major updates,
   - Training on GPU cluster enables training of deep learning models over a GPU
     cluster.
@@ -136,7 +166,7 @@
 - [Apache SINGA 0.2.0 (incubating)](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz)
   [\[MD5\]](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz.md5)
   [\[ASC\]](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz.asc)
-- [Release Notes 0.2.0 (incubating)](releases/RELEASE_NOTES_0.2.0.html)
+- [Release Notes 0.2.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_0.2.0.html)
 - New features and major updates,
   - Training on GPU enables training of complex models on a single node with
     multiple GPU cards.
@@ -164,7 +194,7 @@
   [\[MD5\]](https://archive.apache.org/dist/incubator/singa/apache-singa-incubating-0.1.0.tar.gz.md5)
   [\[ASC\]](https://archive.apache.org/dist/incubator/singa/apache-singa-incubating-0.1.0.tar.gz.asc)
 - [Amazon EC2 image](https://console.aws.amazon.com/ec2/v2/home?region=ap-southeast-1#LaunchInstanceWizard:ami=ami-b41001e6)
-- [Release Notes 0.1.0 (incubating)](releases/RELEASE_NOTES_0.1.0.html)
+- [Release Notes 0.1.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_0.1.0.html)
 - Major features include,
   - Installation using GNU build utility
   - Scripts for job management with zookeeper
diff --git a/docs-site/docs/releases/RELEASE_NOTES_0.1.0.md b/docs-site/docs/releases/RELEASE_NOTES_0.1.0.md
index 9d1dc09..a025794 100644
--- a/docs-site/docs/releases/RELEASE_NOTES_0.1.0.md
+++ b/docs-site/docs/releases/RELEASE_NOTES_0.1.0.md
@@ -1,6 +1,6 @@
 ---
 id: RELEASE_NOTES_0.1.0
-title: singa-incubating-0.1.0 Release Notes
+title: Apache SINGA-incubating-0.1.0 Release Notes
 ---
 
 <!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
diff --git a/docs-site/docs/releases/RELEASE_NOTES_0.2.0.md b/docs-site/docs/releases/RELEASE_NOTES_0.2.0.md
index 77633f3..72c2baf 100644
--- a/docs-site/docs/releases/RELEASE_NOTES_0.2.0.md
+++ b/docs-site/docs/releases/RELEASE_NOTES_0.2.0.md
@@ -1,6 +1,6 @@
 ---
 id: RELEASE_NOTES_0.2.0
-title: singa-incubating-0.2.0 Release Notes
+title: Apache SINGA-incubating-0.2.0 Release Notes
 ---
 
 <!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
diff --git a/docs-site/docs/releases/RELEASE_NOTES_0.3.0.md b/docs-site/docs/releases/RELEASE_NOTES_0.3.0.md
index ed02198..868c173 100644
--- a/docs-site/docs/releases/RELEASE_NOTES_0.3.0.md
+++ b/docs-site/docs/releases/RELEASE_NOTES_0.3.0.md
@@ -1,6 +1,6 @@
 ---
 id: RELEASE_NOTES_0.3.0
-title: singa-incubating-0.3.0 Release Notes
+title: Apache SINGA-incubating-0.3.0 Release Notes
 ---
 
 <!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
diff --git a/docs-site/docs/releases/RELEASE_NOTES_1.0.0.md b/docs-site/docs/releases/RELEASE_NOTES_1.0.0.md
index 0577ede..36ecc98 100644
--- a/docs-site/docs/releases/RELEASE_NOTES_1.0.0.md
+++ b/docs-site/docs/releases/RELEASE_NOTES_1.0.0.md
@@ -1,6 +1,6 @@
 ---
 id: RELEASE_NOTES_1.0.0
-title: singa-incubating-1.0.0 Release Notes
+title: Apache SINGA-incubating-1.0.0 Release Notes
 ---
 
 <!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
diff --git a/docs-site/docs/releases/RELEASE_NOTES_1.1.0.md b/docs-site/docs/releases/RELEASE_NOTES_1.1.0.md
index db79546..41ccd28 100644
--- a/docs-site/docs/releases/RELEASE_NOTES_1.1.0.md
+++ b/docs-site/docs/releases/RELEASE_NOTES_1.1.0.md
@@ -1,6 +1,6 @@
 ---
 id: RELEASE_NOTES_1.1.0
-title: singa-incubating-1.1.0 Release Notes
+title: Apache SINGA-incubating-1.1.0 Release Notes
 ---
 
 <!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
diff --git a/docs-site/docs/releases/RELEASE_NOTES_1.2.0.md b/docs-site/docs/releases/RELEASE_NOTES_1.2.0.md
index aade8bd..2da1cbd 100644
--- a/docs-site/docs/releases/RELEASE_NOTES_1.2.0.md
+++ b/docs-site/docs/releases/RELEASE_NOTES_1.2.0.md
@@ -1,6 +1,6 @@
 ---
 id: RELEASE_NOTES_1.2.0
-title: singa-incubating-1.2.0 Release Notes
+title: Apache SINGA-incubating-1.2.0 Release Notes
 ---
 
 <!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
diff --git a/docs-site/docs/releases/RELEASE_NOTES_2.0.0.md b/docs-site/docs/releases/RELEASE_NOTES_2.0.0.md
index f574d69..9288784 100644
--- a/docs-site/docs/releases/RELEASE_NOTES_2.0.0.md
+++ b/docs-site/docs/releases/RELEASE_NOTES_2.0.0.md
@@ -1,6 +1,6 @@
 ---
 id: RELEASE_NOTES_2.0.0
-title: singa-incubating-2.0.0 Release Notes
+title: Apache SINGA-incubating-2.0.0 Release Notes
 ---
 
 <!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
diff --git a/docs-site/docs/releases/RELEASE_NOTES_3.0.0.md b/docs-site/docs/releases/RELEASE_NOTES_3.0.0.md
index 70a0ea2..2922dec 100644
--- a/docs-site/docs/releases/RELEASE_NOTES_3.0.0.md
+++ b/docs-site/docs/releases/RELEASE_NOTES_3.0.0.md
@@ -1,12 +1,10 @@
 ---
 id: RELEASE_NOTES_3.0.0
-title: singa-3.0.0 Release Notes
+title: Apache SINGA-3.0.0 Release Notes
 ---
 
 <!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
 
-Release Notes - SINGA - Version singa-3.0.0.rc1
-
 SINGA is a distributed deep learning library.
 
 This release includes following changes:
diff --git a/docs-site/docs/releases/RELEASE_NOTES_3.1.0.md b/docs-site/docs/releases/RELEASE_NOTES_3.1.0.md
new file mode 100644
index 0000000..b74f7bf
--- /dev/null
+++ b/docs-site/docs/releases/RELEASE_NOTES_3.1.0.md
@@ -0,0 +1,50 @@
+---
+id: RELEASE_NOTES_3.1.0
+title: Apache SINGA-3.1.0 Release Notes
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+SINGA is a distributed deep learning library.
+
+This release includes following changes:
+
+- Tensor core:
+
+  - Support tensor transformation (reshape, transpose) for tensors up to 6
+    dimensions.
+  - Implement traverse_unary_transform in Cuda backend, which is similar to CPP
+    backend one.
+
+- Add new tensor operators into the autograd module, including CosSim,
+  DepthToSpace, Embedding, Erf, Expand, Floor, Pad, Round, Rounde, SpaceToDepth,
+  UpSample, Where. The corresponding ONNX operators are thus supported by SINGA.
+
+- Add Embedding and Gemm into the layer module.
+
+- Add SGD operators to opt module, including RMSProp, Adam, and AdaGrad.
+
+- Extend the sonnx module to support DenseNet121, ShuffleNetv1, ShuffleNetv2,
+  SqueezeNet, VGG19, GPT2, and RoBERTa,
+
+- Reconstruct sonnx to
+
+  - Support creating operators from both layer and autograd.
+  - Re-write SingaRep to provide a more powerful intermediate representation of
+    SINGA.
+  - Add a SONNXModel which implements from Model to provide uniform API and
+    features.
+
+- Add one example that trains a BiLSTM model over the InsuranceQA data.
+
+- Replace the Travis CI with Github workflow. Add quality and coverage
+  management.
+
+- Add compiling and packaging scripts to creat wheel packages for distribution.
+
+- Fix bugs
+  - Fix IMDB LSTM model example training script.
+  - Fix Tensor operation Mult on Broadcasting use cases.
+  - Gaussian function on Tensor now can run on Tensor with odd size.
+  - Updated a testing helper function gradients() in autograd to lookup param
+    gradient by param python object id for testing purpose.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/autograd.md b/docs-site/website/versioned_docs/version-3.1.0/autograd.md
new file mode 100644
index 0000000..e190cff
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/autograd.md
@@ -0,0 +1,288 @@
+---
+id: version-3.1.0-autograd
+title: Autograd
+original_id: autograd
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+There are two typical ways to implement autograd, via symbolic differentiation
+like [Theano](http://deeplearning.net/software/theano/index.html) or reverse
+differentiation like
+[Pytorch](https://pytorch.org/docs/stable/notes/autograd.html). SINGA follows
+Pytorch way, which records the computation graph and apply the backward
+propagation automatically after forward propagation. The autograd algorithm is
+explained in details
+[here](https://pytorch.org/docs/stable/notes/autograd.html). We explain the
+relevant modules in Singa and give an example to illustrate the usage.
+
+## Relevant Modules
+
+There are three classes involved in autograd, namely `singa.tensor.Tensor`,
+`singa.autograd.Operation`, and `singa.autograd.Layer`. In the rest of this
+article, we use tensor, operation and layer to refer to an instance of the
+respective class.
+
+### Tensor
+
+Three attributes of Tensor are used by autograd,
+
+- `.creator` is an `Operation` instance. It records the operation that generates
+  the Tensor instance.
+- `.requires_grad` is a boolean variable. It is used to indicate that the
+  autograd algorithm needs to compute the gradient of the tensor (i.e., the
+  owner). For example, during backpropagation, the gradients of the tensors for
+  the weight matrix of a linear layer and the feature maps of a convolution
+  layer (not the bottom layer) should be computed.
+- `.stores_grad` is a boolean variable. It is used to indicate that the gradient
+  of the owner tensor should be stored and output by the backward function. For
+  example, the gradient of the feature maps is computed during backpropagation,
+  but is not included in the output of the backward function.
+
+Programmers can change `requires_grad` and `stores_grad` of a Tensor instance.
+For example, if later is set to True, the corresponding gradient is included in
+the output of the backward function. It should be noted that if `stores_grad` is
+True, then `requires_grad` must be true, not vice versa.
+
+### Operation
+
+It takes one or more `Tensor` instances as input, and then outputs one or more
+`Tensor` instances. For example, ReLU can be implemented as a specific Operation
+subclass. When an `Operation` instance is called (after instantiation), the
+following two steps are executed:
+
+1. record the source operations, i.e., the `creator`s of the input tensors.
+2. do calculation by calling member function `.forward()`
+
+There are two member functions for forwarding and backwarding, i.e.,
+`.forward()` and `.backward()`. They take `Tensor.data` as inputs (the type is
+`CTensor`), and output `Ctensor`s. To add a specific operation, subclass
+`operation` should implement their own `.forward()` and `.backward()`. The
+`backward()` function is called by the `backward()` function of autograd
+automatically during backward propogation to compute the gradients of inputs
+(according to the `require_grad` field).
+
+### Layer
+
+For those operations that require parameters, we package them into a new class,
+`Layer`. For example, convolution operation is wrapped into a convolution layer.
+`Layer` manages (stores) the parameters and calls the corresponding `Operation`s
+to implement the transformation.
+
+## Examples
+
+Multiple examples are provided in the
+[example folder](https://github.com/apache/singa/tree/master/examples/autograd).
+We explain two representative examples here.
+
+### Operation only
+
+The following codes implement a MLP model using only Operation instances (no
+Layer instances).
+
+#### Import packages
+
+```python
+from singa.tensor import Tensor
+from singa import autograd
+from singa import opt
+```
+
+#### Create weight matrix and bias vector
+
+The parameter tensors are created with both `requires_grad` and `stores_grad`
+set to `True`.
+
+```python
+w0 = Tensor(shape=(2, 3), requires_grad=True, stores_grad=True)
+w0.gaussian(0.0, 0.1)
+b0 = Tensor(shape=(1, 3), requires_grad=True, stores_grad=True)
+b0.set_value(0.0)
+
+w1 = Tensor(shape=(3, 2), requires_grad=True, stores_grad=True)
+w1.gaussian(0.0, 0.1)
+b1 = Tensor(shape=(1, 2), requires_grad=True, stores_grad=True)
+b1.set_value(0.0)
+```
+
+#### Training
+
+```python
+inputs = Tensor(data=data)  # data matrix
+target = Tensor(data=label) # label vector
+autograd.training = True    # for training
+sgd = opt.SGD(0.05)   # optimizer
+
+for i in range(10):
+    x = autograd.matmul(inputs, w0) # matrix multiplication
+    x = autograd.add_bias(x, b0)    # add the bias vector
+    x = autograd.relu(x)            # ReLU activation operation
+
+    x = autograd.matmul(x, w1)
+    x = autograd.add_bias(x, b1)
+
+    loss = autograd.softmax_cross_entropy(x, target)
+
+    for p, g in autograd.backward(loss):
+        sgd.update(p, g)
+```
+
+### Operation + Layer
+
+The following
+[example](https://github.com/apache/singa/blob/master/examples/autograd/mnist_cnn.py)
+implements a CNN model using layers provided by the autograd module.
+
+#### Create the layers
+
+```python
+conv1 = autograd.Conv2d(1, 32, 3, padding=1, bias=False)
+bn1 = autograd.BatchNorm2d(32)
+pooling1 = autograd.MaxPool2d(3, 1, padding=1)
+conv21 = autograd.Conv2d(32, 16, 3, padding=1)
+conv22 = autograd.Conv2d(32, 16, 3, padding=1)
+bn2 = autograd.BatchNorm2d(32)
+linear = autograd.Linear(32 * 28 * 28, 10)
+pooling2 = autograd.AvgPool2d(3, 1, padding=1)
+```
+
+#### Define the forward function
+
+The operations in the forward pass will be recorded automatically for backward
+propagation.
+
+```python
+def forward(x, t):
+    # x is the input data (a batch of images)
+    # t is the label vector (a batch of integers)
+    y = conv1(x)           # Conv layer
+    y = autograd.relu(y)   # ReLU operation
+    y = bn1(y)             # BN layer
+    y = pooling1(y)        # Pooling Layer
+
+    # two parallel convolution layers
+    y1 = conv21(y)
+    y2 = conv22(y)
+    y = autograd.cat((y1, y2), 1)  # cat operation
+    y = autograd.relu(y)           # ReLU operation
+    y = bn2(y)
+    y = pooling2(y)
+
+    y = autograd.flatten(y)        # flatten operation
+    y = linear(y)                  # Linear layer
+    loss = autograd.softmax_cross_entropy(y, t)  # operation
+    return loss, y
+```
+
+#### Training
+
+```python
+autograd.training = True
+for epoch in range(epochs):
+    for i in range(batch_number):
+        inputs = tensor.Tensor(device=dev, data=x_train[
+                               i * batch_sz:(1 + i) * batch_sz], stores_grad=False)
+        targets = tensor.Tensor(device=dev, data=y_train[
+                                i * batch_sz:(1 + i) * batch_sz], requires_grad=False, stores_grad=False)
+
+        loss, y = forward(inputs, targets) # forward the net
+
+        for p, gp in autograd.backward(loss):  # auto backward
+            sgd.update(p, gp)
+```
+
+### Using the Model API
+
+The following <<<<<<< HEAD
+[example](https://github.com/apache/singa/blob/master/examples/cnn/model/cnn.py)
+implements a CNN model using the [Model API](./graph).
+
+#### Define the subclass of Model
+
+Define the model class, it should be the subclass of Model. In this way, all
+operations used during the training phase will form a computational graph and
+will be analyzed. The operations in the graph will be scheduled and executed
+efficiently. Layers can also be included in the model class.
+
+```python
+class MLP(model.Model):  # the model is a subclass of Model
+
+    def __init__(self, data_size=10, perceptron_size=100, num_classes=10):
+        super(MLP, self).__init__()
+
+        # init the operators, layers and other objects
+        self.relu = layer.ReLU()
+        self.linear1 = layer.Linear(perceptron_size)
+        self.linear2 = layer.Linear(num_classes)
+        self.softmax_cross_entropy = layer.SoftMaxCrossEntropy()
+
+    def forward(self, inputs):  # define the forward function
+        y = self.linear1(inputs)
+        y = self.relu(y)
+        y = self.linear2(y)
+        return y
+
+    def train_one_batch(self, x, y):
+        out = self.forward(x)
+        loss = self.softmax_cross_entropy(out, y)
+        self.optimizer(loss)
+        return out, loss
+
+    def set_optimizer(self, optimizer):  # attach an optimizer
+        self.optimizer = optimizer
+```
+
+#### Training
+
+```python
+# create a model instance
+model = MLP()
+# initialize optimizer and attach it to the model
+sgd = opt.SGD(lr=0.005, momentum=0.9, weight_decay=1e-5)
+model.set_optimizer(sgd)
+# input and target placeholders for the model
+tx = tensor.Tensor((batch_size, 1, IMG_SIZE, IMG_SIZE), dev, tensor.float32)
+ty = tensor.Tensor((batch_size, num_classes), dev, tensor.int32)
+# compile the model before training
+model.compile([tx], is_train=True, use_graph=True, sequential=False)
+
+# train the model iteratively
+for b in range(num_train_batch):
+    # generate the next mini-batch
+    x, y = ...
+
+    # Copy the data into input tensors
+    tx.copy_from_numpy(x)
+    ty.copy_from_numpy(y)
+
+    # Training with one batch
+    out, loss = model(tx, ty)
+```
+
+#### Save a model checkpoint
+
+```python
+# define the path to save the checkpoint
+checkpointpath="checkpoint.zip"
+
+# save a checkpoint
+model.save_states(fpath=checkpointpath)
+```
+
+#### Load a model checkpoint
+
+```python
+# define the path to load the checkpoint
+checkpointpath="checkpoint.zip"
+
+# load a checkpoint
+import os
+if os.path.exists(checkpointpath):
+    model.load_states(fpath=checkpointpath)
+```
+
+### Python API
+
+Refer
+[here](https://singa.readthedocs.io/en/latest/autograd.html#module-singa.autograd)
+for more details of Python API.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/benchmark-train.md b/docs-site/website/versioned_docs/version-3.1.0/benchmark-train.md
new file mode 100644
index 0000000..29f9445
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/benchmark-train.md
@@ -0,0 +1,27 @@
+---
+id: version-3.1.0-benchmark-train
+title: Benchmark for Distributed Training
+original_id: benchmark-train
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+Workload: we use a deep convolutional neural network,
+[ResNet-50](https://github.com/apache/singa/blob/master/examples/cnn/model/resnet.py)
+as the application. ResNet-50 has 50 convolution layers for image
+classification. It requires 3.8 GFLOPs to pass a single image (of size 224x224)
+through the network. The input image size is 224x224.
+
+Hardware: we use p2.8xlarge instances from AWS, each of which has 8 Nvidia Tesla
+K80 GPUs, 96 GB GPU memory in total, 32 vCPU, 488 GB main memory, 10 Gbps
+network bandwidth.
+
+Metric: we measure the time per iteration for different number of workers to
+evaluate the scalability of SINGA. The batch size is fixed to be 32 per GPU.
+Synchronous training scheme is applied. As a result, the effective batch size is
+$32N$, where N is the number of GPUs. We compare with a popular open source
+system which uses the parameter server topology. The first GPU is selected as
+the server.
+
+![Benchmark Experiments](assets/benchmark.png) <br/> **Scalability test. Bars
+are for the throughput; lines are for the communication cost.**
diff --git a/docs-site/website/versioned_docs/version-3.1.0/build.md b/docs-site/website/versioned_docs/version-3.1.0/build.md
new file mode 100644
index 0000000..af4f3cc
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/build.md
@@ -0,0 +1,529 @@
+---
+id: version-3.1.0-build
+title: Build SINGA from Source
+original_id: build
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+The source files could be downloaded either as a
+[tar.gz file](https://dist.apache.org/repos/dist/dev/singa/), or as a git repo
+
+```shell
+$ git clone https://github.com/apache/singa.git
+$ cd singa/
+```
+
+If you want to contribute code to SINGA, refer to
+[contribute-code page](contribute-code.md) for the steps and requirements.
+
+## Use Conda to build SINGA
+
+Conda-build is a building tool that installs the dependent libraries from
+anaconda cloud and executes the building scripts.
+
+To install conda-build (after installing conda)
+
+```shell
+conda install conda-build
+```
+
+### Build CPU Version
+
+To build the CPU version of SINGA
+
+```shell
+conda build tool/conda/singa/
+```
+
+The above commands have been tested on Ubuntu (14.04, 16.04 and 18.04) and macOS
+10.11. Refer to the [Travis-CI page](https://travis-ci.org/apache/singa) for
+more information.
+
+### Build GPU Version
+
+To build the GPU version of SINGA, the building machine must have Nvida GPU, and
+the CUDA driver (>= 384.81), CUDA toolkit (>=9) and cuDNN (>=7) must have be
+installed. The following two Docker images provide the building environment:
+
+1. apache/singa:conda-cuda9.0
+2. apache/singa:conda-cuda10.0
+
+Once the building environment is ready, you need to export the CUDA version
+first, and then run conda command to build SINGA
+
+```shell
+export CUDA=x.y (e.g. 9.0)
+conda build tool/conda/singa/
+```
+
+### Post Processing
+
+The location of the generated package file (`.tar.gz`) is shown on the screen.
+The generated package can be installed directly,
+
+```shell
+conda install -c conda-forge --use-local <path to the package file>
+```
+
+or uploaded to anaconda cloud for others to download and install. You need to
+register an account on anaconda for
+[uploading the package](https://docs.anaconda.com/anaconda-cloud/user-guide/getting-started/).
+
+```shell
+conda install anaconda-client
+anaconda login
+anaconda upload -l main <path to the package file>
+```
+
+After uploading the package to the cloud, you can see it on
+[Anaconda Cloud](https://anaconda.org/) website or via the following command
+
+```shell
+conda search -c <anaconda username> singa
+```
+
+Each specific SINGA package is identified by the version and build string. To
+install a specific SINGA package, you need to provide all the information, e.g.,
+
+```shell
+conda install -c <anaconda username> -c conda-forge singa=2.1.0.dev=cpu_py36
+```
+
+To make the installation command simple, you can create the following additional
+packages which depend on the latest CPU and GPU SINGA packages.
+
+```console
+# for singa-cpu
+conda build tool/conda/cpu/  --python=3.6
+conda build tool/conda/cpu/  --python=3.7
+# for singa-gpu
+conda build tool/conda/gpu/  --python=3.6
+conda build tool/conda/gpu/  --python=3.7
+```
+
+Therefore, when you run
+
+```shell
+conda install -c <anaconda username> -c conda-forge singa-xpu
+```
+
+(`xpu` is either 'cpu' or 'gpu'), the corresponding real SINGA package is
+installed as the dependent library.
+
+## Use native tools to build SINGA on Ubuntu
+
+Refer to SINGA
+[Dockerfiles](https://github.com/apache/singa/blob/master/tool/docker/devel/ubuntu/cuda9/Dockerfile#L30)
+for the instructions of installing the dependent libraries on Ubuntu 16.04. You
+can also create a Docker container using the [devel images]() and build SINGA
+inside the container. To build SINGA with GPU, DNNL, Python and unit tests, run
+the following instructions
+
+```shell
+mkdir build    # at the root of singa folder
+cd build
+cmake -DENABLE_TEST=ON -DUSE_CUDA=ON -DUSE_DNNL=ON -DUSE_PYTHON3=ON ..
+make
+cd python
+pip install .
+```
+
+The details of the CMake options are explained in the last section of this page.
+The last command install the Python package. You can also run
+`pip install -e .`, which creates symlinks instead of copying the Python files
+into the site-package folder.
+
+If SINGA is compiled with ENABLE_TEST=ON, you can run the unit tests by
+
+```shell
+$ ./bin/test_singa
+```
+
+You can see all the testing cases with testing results. If SINGA passes all
+tests, then you have successfully installed SINGA.
+
+## Use native tools to Build SINGA on Centos7
+
+Building from source will be different for Centos7 as package names
+differ.Follow the instructions given below.
+
+### Installing dependencies
+
+Basic packages/libraries
+
+```shell
+sudo yum install freetype-devel libXft-devel ncurses-devel openblas-devel blas-devel lapack devel atlas-devel kernel-headers unzip wget pkgconfig zip zlib-devel libcurl-devel cmake curl unzip dh-autoreconf git python-devel glog-devel protobuf-devel
+```
+
+For build-essential
+
+```shell
+sudo yum group install "Development Tools"
+```
+
+For installing swig
+
+```shell
+sudo yum install pcre-devel
+wget http://prdownloads.sourceforge.net/swig/swig-3.0.10.tar.gz
+tar xvzf swig-3.0.10.tar.gz
+cd swig-3.0.10.tar.gz
+./configure --prefix=${RUN}
+make
+make install
+```
+
+For installing gfortran
+
+```shell
+sudo yum install centos-release-scl-rh
+sudo yum --enablerepo=centos-sclo-rh-testing install devtoolset-7-gcc-gfortran
+```
+
+For installing pip and other packages
+
+```shell
+sudo yum install epel-release
+sudo yum install python-pip
+pip install matplotlib numpy pandas scikit-learn pydot
+```
+
+### Installation
+
+Follow steps 1-5 of _Use native tools to build SINGA on Ubuntu_
+
+### Testing
+
+You can run the unit tests by,
+
+```shell
+$ ./bin/test_singa
+```
+
+You can see all the testing cases with testing results. If SINGA passes all
+tests, then you have successfully installed SINGA.
+
+## Compile SINGA on Windows
+
+Instructions for building on Windows with Python support can be found
+[install-win page](install-win.md).
+
+## More details about the compilation options
+
+### USE_MODULES (deprecated)
+
+If protobuf and openblas are not installed, you can compile SINGA together with
+them
+
+```shell
+$ In SINGA ROOT folder
+$ mkdir build
+$ cd build
+$ cmake -DUSE_MODULES=ON ..
+$ make
+```
+
+cmake would download OpenBlas and Protobuf (2.6.1) and compile them together
+with SINGA.
+
+You can use `ccmake ..` to configure the compilation options. If some dependent
+libraries are not in the system default paths, you need to export the following
+environment variables
+
+```shell
+export CMAKE_INCLUDE_PATH=<path to the header file folder>
+export CMAKE_LIBRARY_PATH=<path to the lib file folder>
+```
+
+### USE_PYTHON
+
+Option for compiling the Python wrapper for SINGA,
+
+```shell
+$ cmake -DUSE_PYTHON=ON ..
+$ make
+$ cd python
+$ pip install .
+```
+
+### USE_CUDA
+
+Users are encouraged to install the CUDA and
+[cuDNN](https://developer.nvidia.com/cudnn) for running SINGA on GPUs to get
+better performance.
+
+SINGA has been tested over CUDA 9/10, and cuDNN 7. If cuDNN is installed into
+non-system folder, e.g. /home/bob/local/cudnn/, the following commands should be
+executed for cmake and the runtime to find it
+
+```shell
+$ export CMAKE_INCLUDE_PATH=/home/bob/local/cudnn/include:$CMAKE_INCLUDE_PATH
+$ export CMAKE_LIBRARY_PATH=/home/bob/local/cudnn/lib64:$CMAKE_LIBRARY_PATH
+$ export LD_LIBRARY_PATH=/home/bob/local/cudnn/lib64:$LD_LIBRARY_PATH
+```
+
+The cmake options for CUDA and cuDNN should be switched on
+
+```shell
+# Dependent libs are install already
+$ cmake -DUSE_CUDA=ON ..
+$ make
+```
+
+### USE_DNNL
+
+User can enable DNNL to enhance the performance of CPU computation.
+
+Installation guide of DNNL could be found
+[here](https://github.com/intel/mkl-dnn#installation).
+
+SINGA has been tested over DNNL v1.1.
+
+To build SINGA with DNNL support:
+
+```shell
+# Dependent libs are installed already
+$ cmake -DUSE_DNNL=ON ..
+$ make
+```
+
+### USE_OPENCL
+
+SINGA uses opencl-headers and viennacl (version 1.7.1 or newer) for OpenCL
+support, which can be installed via
+
+```shell
+# On Ubuntu 16.04
+$ sudo apt-get install opencl-headers, libviennacl-dev
+# On Fedora
+$ sudo yum install opencl-headers, viennacl
+```
+
+Additionally, you will need the OpenCL Installable Client Driver (ICD) for the
+platforms that you want to run OpenCL on.
+
+- For AMD and nVidia GPUs, the driver package should also install the correct
+  OpenCL ICD.
+- For Intel CPUs and/or GPUs, get the driver from the
+  [Intel website.](https://software.intel.com/en-us/articles/opencl-drivers)
+  Note that the drivers provided on that website only supports recent CPUs and
+  Iris GPUs.
+- For older Intel CPUs, you can use the `beignet-opencl-icd` package.
+
+Note that running OpenCL on CPUs is not currently recommended because it is
+slow. Memory transfer is on the order of whole seconds (1000's of ms on CPUs as
+compared to 1's of ms on GPUs).
+
+More information on setting up a working OpenCL environment may be found
+[here](https://wiki.tiker.net/OpenCLHowTo).
+
+If the package version of ViennaCL is not at least 1.7.1, you will need to build
+it from source:
+
+Clone [the repository from here](https://github.com/viennacl/viennacl-dev),
+checkout the `release-1.7.1` tag and build it. Remember to add its directory to
+`PATH` and the built libraries to `LD_LIBRARY_PATH`.
+
+To build SINGA with OpenCL support (tested on SINGA 1.1):
+
+```shell
+$ cmake -DUSE_OPENCL=ON ..
+$ make
+```
+
+### PACKAGE
+
+This setting is used to build the Debian package. Set PACKAGE=ON and build the
+package with make command like this:
+
+```shell
+$ cmake -DPACKAGE=ON
+$ make package
+```
+
+## FAQ
+
+- Q: Error from 'import singa'
+
+  A: Please check the detailed error from
+  `python -c "from singa import _singa_wrap"`. Sometimes it is caused by the
+  dependent libraries, e.g. there are multiple versions of protobuf, missing of
+  cudnn, numpy version mismatch. Following steps show the solutions for
+  different cases
+
+  1. Check the cudnn and cuda. If cudnn is missing or not match with the wheel
+     version, you can download the correct version of cudnn into ~/local/cudnn/
+     and
+
+     ```shell
+     $ echo "export LD_LIBRARY_PATH=/home/<yourname>/local/cudnn/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc
+     ```
+
+  2. If it is the problem related to protobuf. You can install protobuf (3.6.1)
+     from source into a local folder, say ~/local/; Decompress the tar file, and
+     then
+
+     ```shell
+     $ ./configure --prefix=/home/<yourname>local
+     $ make && make install
+     $ echo "export LD_LIBRARY_PATH=/home/<yourname>/local/lib:$LD_LIBRARY_PATH" >> ~/.bashrc
+     $ source ~/.bashrc
+     ```
+
+  3. If it cannot find other libs including python, then create virtual env
+     using `pip` or `conda`;
+
+  4. If it is not caused by the above reasons, go to the folder of
+     `_singa_wrap.so`,
+
+     ```shell
+     $ python
+     >> import importlib
+     >> importlib.import_module('_singa_wrap')
+     ```
+
+     Check the error message. For example, if the numpy version mismatches, the
+     error message would be,
+
+     ```shell
+     RuntimeError: module compiled against API version 0xb but this version of numpy is 0xa
+     ```
+
+     Then you need to upgrade the numpy.
+
+* Q: Error from running `cmake ..`, which cannot find the dependent libraries.
+
+  A: If you haven't installed the libraries, install them. If you installed the
+  libraries in a folder that is outside of the system folder, e.g. /usr/local,
+  you need to export the following variables
+
+  ```shell
+  $ export CMAKE_INCLUDE_PATH=<path to your header file folder>
+  $ export CMAKE_LIBRARY_PATH=<path to your lib file folder>
+  ```
+
+- Q: Error from `make`, e.g. the linking phase
+
+  A: If your libraries are in other folders than system default paths, you need
+  to export the following varaibles
+
+  ```shell
+  $ export LIBRARY_PATH=<path to your lib file folder>
+  $ export LD_LIBRARY_PATH=<path to your lib file folder>
+  ```
+
+* Q: Error from header files, e.g. 'cblas.h no such file or directory exists'
+
+  A: You need to include the folder of the cblas.h into CPLUS_INCLUDE_PATH,
+  e.g.,
+
+  ```shell
+  $ export CPLUS_INCLUDE_PATH=/opt/OpenBLAS/include:$CPLUS_INCLUDE_PATH
+  ```
+
+* Q:While compiling SINGA, I get error `SSE2 instruction set not enabled`
+
+  A:You can try following command:
+
+  ```shell
+  $ make CFLAGS='-msse2' CXXFLAGS='-msse2'
+  ```
+
+* Q:I get `ImportError: cannot import name enum_type_wrapper` from
+  google.protobuf.internal when I try to import .py files.
+
+  A: You need to install the python binding of protobuf, which could be
+  installed via
+
+  ```shell
+  $ sudo apt-get install protobuf
+  ```
+
+  or from source
+
+  ```shell
+  $ cd /PROTOBUF/SOURCE/FOLDER
+  $ cd python
+  $ python setup.py build
+  $ python setup.py install
+  ```
+
+* Q: When I build OpenBLAS from source, I am told that I need a Fortran
+  compiler.
+
+  A: You can compile OpenBLAS by
+
+  ```shell
+  $ make ONLY_CBLAS=1
+  ```
+
+  or install it using
+
+  ```shell
+  $ sudo apt-get install libopenblas-dev
+  ```
+
+* Q: When I build protocol buffer, it reports that `GLIBC++_3.4.20` not found in
+  `/usr/lib64/libstdc++.so.6`?
+
+  A: This means the linker found libstdc++.so.6 but that library belongs to an
+  older version of GCC than was used to compile and link the program. The
+  program depends on code defined in the newer libstdc++ that belongs to the
+  newer version of GCC, so the linker must be told how to find the newer
+  libstdc++ shared library. The simplest way to fix this is to find the correct
+  libstdc++ and export it to LD_LIBRARY_PATH. For example, if GLIBC++\_3.4.20 is
+  listed in the output of the following command,
+
+        $ strings /usr/local/lib64/libstdc++.so.6|grep GLIBC++
+
+  then you just set your environment variable as
+
+        $ export LD_LIBRARY_PATH=/usr/local/lib64:$LD_LIBRARY_PATH
+
+* Q: When I build glog, it reports that "src/logging_unittest.cc:83:20: error:
+  ‘gflags’ is not a namespace-name"
+
+  A: It maybe that you have installed gflags with a different namespace such as
+  "google". so glog can't find 'gflags' namespace. Because it is not necessary
+  to have gflags to build glog. So you can change the configure.ac file to
+  ignore gflags.
+
+        1. cd to glog src directory
+        2. change line 125 of configure.ac  to "AC_CHECK_LIB(gflags, main, ac_cv_have_libgflags=0, ac_cv_have_libgflags=0)"
+        3. autoreconf
+
+  After this, you can build glog again.
+
+* Q: When using virtual environment, every time I run pip install, it would
+  reinstall numpy. However, the numpy would not be used when I `import numpy`
+
+  A: It could be caused by the `PYTHONPATH` which should be set to empty when
+  you are using virtual environment to avoid the conflicts with the path of the
+  virtual environment.
+
+* Q: When compiling PySINGA from source, there is a compilation error due to the
+  missing of <numpy/objectarray.h>
+
+  A: Please install numpy and export the path of numpy header files as
+
+        $ export CPLUS_INCLUDE_PATH=`python -c "import numpy; print numpy.get_include()"`:$CPLUS_INCLUDE_PATH
+
+* Q: When I run SINGA in Mac OS X, I got the error "Fatal Python error:
+  PyThreadState_Get: no current thread Abort trap: 6"
+
+  A: This error happens typically when you have multiple version of Python on
+  your system and you installed SINGA via pip (this problem is resolved for
+  installation via conda), e.g, the one comes with the OS and the one installed
+  by Homebrew. The Python linked by PySINGA must be the same as the Python
+  interpreter. You can check your interpreter by `which python` and check the
+  Python linked by PySINGA via `otool -L <path to _singa_wrap.so>`. To fix this
+  error, compile SINGA with the correct version of Python. In particular, if you
+  build PySINGA from source, you need to specify the paths when invoking
+  [cmake](http://stackoverflow.com/questions/15291500/i-have-2-versions-of-python-installed-but-cmake-is-using-older-version-how-do)
+
+        $ cmake -DPYTHON_LIBRARY=`python-config --prefix`/lib/libpython2.7.dylib -DPYTHON_INCLUDE_DIR=`python-config --prefix`/include/python2.7/ ..
+
+  If installed PySINGA from binary packages, e.g. debian or wheel, then you need
+  to change the python interpreter, e.g., reset the \$PATH to put the correct
+  path of Python at the front position.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/contribute-code.md b/docs-site/website/versioned_docs/version-3.1.0/contribute-code.md
new file mode 100644
index 0000000..e2602b1
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/contribute-code.md
@@ -0,0 +1,126 @@
+---
+id: version-3.1.0-contribute-code
+title: How to Contribute Code
+original_id: contribute-code
+---
+
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License. -->
+
+## Coding Style
+
+The SINGA codebase follows the Google Style for both
+[CPP](http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml) and
+[Python](http://google.github.io/styleguide/pyguide.html) code.
+
+A simple way to enforce the Google coding styles is to use the linting and
+formating tools in the Visual Studio Code editor:
+
+- [C/C++ extension](https://marketplace.visualstudio.com/items?itemName=ms-vscode.cpptools)
+- [Python extension](https://marketplace.visualstudio.com/items?itemName=ms-python.python)
+- [cpplint extension](https://marketplace.visualstudio.com/items?itemName=mine.cpplint)
+- [Clang-Format](https://marketplace.visualstudio.com/items?itemName=xaver.clang-format)
+
+Once the extensions are installed, edit the `settings.json` file.
+
+```json
+{
+  "[cpp]": {
+    "editor.defaultFormatter": "xaver.clang-format"
+  },
+  "cpplint.cpplintPath": "path/to/cpplint",
+
+  "editor.formatOnSave": true,
+  "python.formatting.provider": "yapf",
+  "python.linting.enabled": true,
+  "python.linting.lintOnSave": true,
+  "clang-format.language.cpp.style": "google",
+  "python.formatting.yapfArgs": ["--style", "{based_on_style: google}"]
+}
+```
+
+Depending on your platform, the user settings file is located here:
+
+1. Windows %APPDATA%\Code\User\settings.json
+2. macOS "\$HOME/Library/Application Support/Code/User/settings.json"
+3. Linux "\$HOME/.config/Code/User/settings.json"
+
+Configurations are specified in corresponding config files. And these tools
+would look up for configuration files in the root of the project automatically,
+e.g. `.pylintrc`.
+
+#### Tool Installation
+
+It is ideal when all the contributors uses the same version of code formatting
+tool (clang-format 9.0.0 and yapf 0.29.0), so that all code formatting in
+different PRs would be identical to get rid of github pull request conflicts.
+
+First, install LLVM 9.0 which provides clang-format version 9.0.0. The download
+page of LLVM is:
+
+- [LLVM](http://releases.llvm.org/download.html#9.0.0)
+
+  - On Ubuntu
+
+    ```sh
+    sudo apt-get install clang-format-9
+    ```
+
+  - On Windows. Download the pre-built package and install
+
+Second, install cpplint, pylint and yapf
+
+- Ubuntu or OSX:
+
+  ```
+  $ sudo pip install cpplint
+  $ which cpplint
+  /path/to/cpplint
+
+  $ pip install yapf==0.29.0
+  $ pip install pylint
+  ```
+
+- Windows: Install Anaconda for package management.
+
+  ```
+  $ pip install cpplint
+  $ where cpplint
+  C:/path/to/cpplint.exe
+
+  $ pip install yapf==0.29.0
+  $ pip install pylint
+  ```
+
+#### Usage
+
+- After the configuration, linting should be automatically applied when editing
+  source code file. Errors and warnings are listed in Visual Studio Code
+  `PROBLEMS` panel.
+- Code Formatting could be done by bringing up Command Palette(`Shift+Ctrl+P` in
+  Windows or `Shift+Command+P` in OSX) and type `Format Document`.
+
+#### Submission
+
+You need to fix the format errors before submitting the pull requests.
+
+## Developing Environment
+
+Visual Studio Code is recommended as the editor. Extensions like Python, C/C++,
+Code Spell Checker, autoDocstring, vim, Remote Development could be installed. A
+reference configuration (i.e., `settings.json`) of these extensions is
+[here](https://gist.github.com/nudles/3d23cfb6ffb30ca7636c45fe60278c55).
+
+If you update the CPP code, you need to recompile SINGA
+[from source](./build.md). It is recommended to use the native building tools in
+the `*-devel` Docker images or `conda build`.
+
+If you only update the Python code, you can install SINGAS once, and then copy
+the updated Python files to replace those in the Python installation folder,
+
+```shell
+cp python/singa/xx.py  <path to conda>/lib/python3.7/site-packages/singa/
+```
+
+## Workflow
+
+Please refer to the [git workflow page](./git-workflow.md).
diff --git a/docs-site/website/versioned_docs/version-3.1.0/dist-train.md b/docs-site/website/versioned_docs/version-3.1.0/dist-train.md
new file mode 100644
index 0000000..3b3835a
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/dist-train.md
@@ -0,0 +1,452 @@
+---
+id: version-3.1.0-dist-train
+title: Distributed Training
+original_id: dist-train
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+SINGA supports data parallel training across multiple GPUs (on a single node or
+across different nodes). The following figure illustrates the data parallel
+training:
+
+![MPI.png](assets/MPI.png)
+
+In distributed training, each process (called a worker) runs a training script
+over a single GPU. Each process has an individual communication rank. The
+training data is partitioned among the workers and the model is replicated on
+every worker. In each iteration, the workers read a mini-batch of data (e.g.,
+256 images) from its partition and run the BackPropagation algorithm to compute
+the gradients of the weights, which are averaged via all-reduce (provided by
+[NCCL](https://developer.nvidia.com/nccl)) for weight update following
+stochastic gradient descent algorithms (SGD).
+
+The all-reduce operation by NCCL can be used to reduce and synchronize the
+gradients from different GPUs. Let's consider the training with 4 GPUs as shown
+below. Once the gradients from the 4 GPUs are calculated, all-reduce will return
+the sum of the gradients over the GPUs and make it available on every GPU. Then
+the averaged gradients can be easily calculated.
+
+![AllReduce.png](assets/AllReduce.png)
+
+## Usage
+
+SINGA implements a module called `DistOpt` (a subclass of `Opt`) for distributed
+training. It wraps a normal SGD optimizer and calls `Communicator` for gradients
+synchronization. The following example illustrates the usage of `DistOpt` for
+training a CNN model over the MNIST dataset. The source code is available
+[here](https://github.com/apache/singa/blob/master/examples/cnn/), and there is
+a [Colab notebook]() for it.
+
+### Example Code
+
+1. Define the neural network model:
+
+```python
+class CNN(model.Model):
+
+    def __init__(self, num_classes=10, num_channels=1):
+        super(CNN, self).__init__()
+        self.conv1 = layer.Conv2d(num_channels, 20, 5, padding=0, activation="RELU")
+        self.conv2 = layer.Conv2d(20, 50, 5, padding=0, activation="RELU")
+        self.linear1 = layer.Linear(500)
+        self.linear2 = layer.Linear(num_classes)
+        self.pooling1 = layer.MaxPool2d(2, 2, padding=0)
+        self.pooling2 = layer.MaxPool2d(2, 2, padding=0)
+        self.relu = layer.ReLU()
+        self.flatten = layer.Flatten()
+        self.softmax_cross_entropy = layer.SoftMaxCrossEntropy()
+
+    def forward(self, x):
+        y = self.conv1(x)
+        y = self.pooling1(y)
+        y = self.conv2(y)
+        y = self.pooling2(y)
+        y = self.flatten(y)
+        y = self.linear1(y)
+        y = self.relu(y)
+        y = self.linear2(y)
+        return y
+
+    def train_one_batch(self, x, y, dist_option='fp32', spars=0):
+        out = self.forward(x)
+        loss = self.softmax_cross_entropy(out, y)
+
+        # Allow different options for distributed training
+        # See the section "Optimizations for Distributed Training"
+        if dist_option == 'fp32':
+            self.optimizer(loss)
+        elif dist_option == 'fp16':
+            self.optimizer.backward_and_update_half(loss)
+        elif dist_option == 'partialUpdate':
+            self.optimizer.backward_and_partial_update(loss)
+        elif dist_option == 'sparseTopK':
+            self.optimizer.backward_and_sparse_update(loss,
+                                                      topK=True,
+                                                      spars=spars)
+        elif dist_option == 'sparseThreshold':
+            self.optimizer.backward_and_sparse_update(loss,
+                                                      topK=False,
+                                                      spars=spars)
+        return out, loss
+
+# create model
+model = CNN()
+```
+
+2. Create the `DistOpt` instance and attach it to the created model:
+
+```python
+sgd = opt.SGD(lr=0.005, momentum=0.9, weight_decay=1e-5)
+sgd = opt.DistOpt(sgd)
+model.set_optimizer(sgd)
+dev = device.create_cuda_gpu_on(sgd.local_rank)
+```
+
+Here are some explanations concerning some variables in the code:
+
+(i) `dev`
+
+dev represents the `Device` instance, where to load data and run the CNN model.
+
+(ii)`local_rank`
+
+Local rank represents the GPU number the current process is using in the same
+node. For example, if you are using a node with 2 GPUs, `local_rank=0` means
+that this process is using the first GPU, while `local_rank=1` means using the
+second GPU. Using MPI or multiprocess, you are able to run the same training
+script which is only different in the value of `local_rank`.
+
+(iii)`global_rank`
+
+Rank in global represents the global rank considered all the processes in all
+the nodes you are using. Let's consider the case you have 3 nodes and each of
+the node has two GPUs, `global_rank=0` means the process using the 1st GPU at
+the 1st node, `global_rank=2` means the process using the 1st GPU of the 2nd
+node, and `global_rank=4` means the process using the 1st GPU of the 3rd node.
+
+3. Load and partition the training/validation data:
+
+```python
+def data_partition(dataset_x, dataset_y, global_rank, world_size):
+    data_per_rank = dataset_x.shape[0] // world_size
+    idx_start = global_rank * data_per_rank
+    idx_end = (global_rank + 1) * data_per_rank
+    return dataset_x[idx_start:idx_end], dataset_y[idx_start:idx_end]
+
+train_x, train_y, test_x, test_y = load_dataset()
+train_x, train_y = data_partition(train_x, train_y,
+                                  sgd.global_rank, sgd.world_size)
+test_x, test_y = data_partition(test_x, test_y,
+                                sgd.global_rank, sgd.world_size)
+```
+
+A partition of the dataset is returned for this `dev`.
+
+Here, `world_size` represents the total number of processes in all the nodes you
+are using for distributed training.
+
+4. Initialize and synchronize the model parameters among all workers:
+
+```python
+#Synchronize the initial parameter
+tx = tensor.Tensor((batch_size, 1, IMG_SIZE, IMG_SIZE), dev, tensor.float32)
+ty = tensor.Tensor((batch_size, num_classes), dev, tensor.int32)
+model.compile([tx], is_train=True, use_graph=graph, sequential=True)
+...
+#Use the same random seed for different ranks
+seed = 0
+dev.SetRandSeed(seed)
+np.random.seed(seed)
+```
+
+5. Run BackPropagation and distributed SGD
+
+```python
+for epoch in range(max_epoch):
+    for b in range(num_train_batch):
+        x = train_x[idx[b * batch_size: (b + 1) * batch_size]]
+        y = train_y[idx[b * batch_size: (b + 1) * batch_size]]
+        tx.copy_from_numpy(x)
+        ty.copy_from_numpy(y)
+        # Train the model
+        out, loss = model(tx, ty)
+```
+
+### Execution Instruction
+
+There are two ways to launch the training: MPI or Python multiprocessing.
+
+#### Python multiprocessing
+
+It works on a single node with multiple GPUs, where each GPU is one worker.
+
+1. Put all the above training codes in a function
+
+```python
+def train_mnist_cnn(nccl_id=None, local_rank=None, world_size=None):
+    ...
+```
+
+2. Create `mnist_multiprocess.py`
+
+```python
+if __name__ == '__main__':
+    # Generate a NCCL ID to be used for collective communication
+    nccl_id = singa.NcclIdHolder()
+
+    # Define the number of GPUs to be used in the training process
+    world_size = int(sys.argv[1])
+
+    # Define and launch the multi-processing
+	import multiprocessing
+    process = []
+    for local_rank in range(0, world_size):
+        process.append(multiprocessing.Process(target=train_mnist_cnn,
+                       args=(nccl_id, local_rank, world_size)))
+
+    for p in process:
+        p.start()
+```
+
+Here are some explanations concerning the variables created above:
+
+(i) `nccl_id`
+
+Note that we need to generate a NCCL ID here to be used for collective
+communication, and then pass it to all the processes. The NCCL ID is like a
+ticket, where only the processes with this ID can join the all-reduce operation.
+(Later if we use MPI, the passing of NCCL ID is not necessary, because the ID is
+broadcased by MPI in our code automatically)
+
+(ii) `world_size`
+
+world_size is the number of GPUs you would like to use for training.
+
+(iii) `local_rank`
+
+local_rank determine the local rank of the distributed training and which gpu is
+used in the process. In the code above, we used a for loop to run the train
+function where the argument local_rank iterates from 0 to world_size. In this
+case, different processes can use different GPUs for training.
+
+The arguments for creating the `DistOpt` instance should be updated as follows
+
+```python
+sgd = opt.DistOpt(sgd, nccl_id=nccl_id, local_rank=local_rank, world_size=world_size)
+```
+
+3. Run `mnist_multiprocess.py`
+
+```sh
+python mnist_multiprocess.py 2
+```
+
+It results in speed up compared to the single GPU training.
+
+```
+Starting Epoch 0:
+Training loss = 408.909790, training accuracy = 0.880475
+Evaluation accuracy = 0.956430
+Starting Epoch 1:
+Training loss = 102.396790, training accuracy = 0.967415
+Evaluation accuracy = 0.977564
+Starting Epoch 2:
+Training loss = 69.217010, training accuracy = 0.977915
+Evaluation accuracy = 0.981370
+Starting Epoch 3:
+Training loss = 54.248390, training accuracy = 0.982823
+Evaluation accuracy = 0.984075
+Starting Epoch 4:
+Training loss = 45.213406, training accuracy = 0.985560
+Evaluation accuracy = 0.985276
+Starting Epoch 5:
+Training loss = 38.868435, training accuracy = 0.987764
+Evaluation accuracy = 0.986278
+Starting Epoch 6:
+Training loss = 34.078186, training accuracy = 0.989149
+Evaluation accuracy = 0.987881
+Starting Epoch 7:
+Training loss = 30.138697, training accuracy = 0.990451
+Evaluation accuracy = 0.988181
+Starting Epoch 8:
+Training loss = 26.854443, training accuracy = 0.991520
+Evaluation accuracy = 0.988682
+Starting Epoch 9:
+Training loss = 24.039650, training accuracy = 0.992405
+Evaluation accuracy = 0.989083
+```
+
+#### MPI
+
+It works for both single node and multiple nodes as long as there are multiple
+GPUs.
+
+1. Create `mnist_dist.py`
+
+```python
+if __name__ == '__main__':
+    train_mnist_cnn()
+```
+
+2. Generate a hostfile for MPI, e.g. the hostfile below uses 2 processes (i.e.,
+   2 GPUs) on a single node
+
+```txt
+localhost:2
+```
+
+3. Launch the training via `mpiexec`
+
+```sh
+mpiexec --hostfile host_file python mnist_dist.py
+```
+
+It could result in speed up compared to the single GPU training.
+
+```
+Starting Epoch 0:
+Training loss = 383.969543, training accuracy = 0.886402
+Evaluation accuracy = 0.954327
+Starting Epoch 1:
+Training loss = 97.531479, training accuracy = 0.969451
+Evaluation accuracy = 0.977163
+Starting Epoch 2:
+Training loss = 67.166870, training accuracy = 0.978516
+Evaluation accuracy = 0.980769
+Starting Epoch 3:
+Training loss = 53.369656, training accuracy = 0.983040
+Evaluation accuracy = 0.983974
+Starting Epoch 4:
+Training loss = 45.100403, training accuracy = 0.985777
+Evaluation accuracy = 0.986078
+Starting Epoch 5:
+Training loss = 39.330826, training accuracy = 0.987447
+Evaluation accuracy = 0.987179
+Starting Epoch 6:
+Training loss = 34.655270, training accuracy = 0.988799
+Evaluation accuracy = 0.987780
+Starting Epoch 7:
+Training loss = 30.749735, training accuracy = 0.989984
+Evaluation accuracy = 0.988281
+Starting Epoch 8:
+Training loss = 27.422146, training accuracy = 0.991319
+Evaluation accuracy = 0.988582
+Starting Epoch 9:
+Training loss = 24.548153, training accuracy = 0.992171
+Evaluation accuracy = 0.988682
+```
+
+## Optimizations for Distributed Training
+
+SINGA provides multiple optimization strategies for distributed training to
+reduce the communication cost. Refer to the API for `DistOpt` for the
+configuration of each strategy.
+
+When we use `model.Model` to build a model, we need to put the options for
+distributed training in the `train_one_batch` method. Please refer to the
+example code on top of this page. We could just copy the code for the options
+and use it in other models.
+
+With the defined options, we can put the arguments `dist_option` and `spars`
+when we start the training with `model(tx, ty, dist_option, spars)`
+
+### No Optimizations
+
+```python
+out, loss = model(tx, ty)
+```
+
+`loss` is the output tensor from the loss function, e.g., cross-entropy for
+classification tasks.
+
+### Half-precision Gradients
+
+```python
+out, loss = model(tx, ty, dist_option = 'fp16')
+```
+
+It converts each gradient value to 16-bit representation (i.e., half-precision)
+before calling all-reduce.
+
+### Partial Synchronization
+
+```python
+out, loss = model(tx, ty, dist_option = 'partialUpdate')
+```
+
+In each iteration, every rank do the local sgd update. Then, only a chunk of
+parameters are averaged for synchronization, which saves the communication cost.
+The chunk size is configured when creating the `DistOpt` instance.
+
+### Gradient Sparsification
+
+It applies sparsification schemes to select a subset of gradients for
+all-reduce. There are two scheme:
+
+- The top-K largest elements are selected. spars is the portion (0 - 1) of total
+  elements selected.
+
+```python
+out, loss = model(tx, ty, dist_option = 'sparseTopK', spars = spars)
+```
+
+- All gradients whose absolute value are larger than predefined threshold spars
+  are selected.
+
+```python
+out, loss = model(tx, ty, dist_option = 'sparseThreshold', spars = spars)
+```
+
+The hyper-parameters are configured when creating the `DistOpt` instance.
+
+## Implementation
+
+This section is mainly for developers who want to know how the code in
+distribute module is implemented.
+
+### C interface for NCCL communicator
+
+Firstly, the communication layer is written in C language
+[communicator.cc](https://github.com/apache/singa/blob/master/src/io/communicator.cc).
+It applies the NCCL library for collective communication.
+
+There are two constructors for the communicator, one for MPI and another for
+multiprocess.
+
+(i) Constructor using MPI
+
+The constructor first obtains the global rank and the world size first, and
+calculate the local rank. Then, rank 0 generates a NCCL ID and broadcast it to
+every rank. After that, it calls the setup function to initialize the NCCL
+communicator, cuda streams, and buffers.
+
+(ii) Constructor using Python multiprocess
+
+The constructor first obtains the rank, the world size, and the NCCL ID from the
+input argument. After that, it calls the setup function to initialize the NCCL
+communicator, cuda streams, and buffers.
+
+After the initialization, it provides the all-reduce functionality to
+synchronize the model parameters or gradients. For instance, synch takes a input
+tensor and perform all-reduce through the NCCL routine. After we call synch, it
+is necessary to call wait function to wait for the all-reduce operation to be
+completed.
+
+### Python interface for DistOpt
+
+Then, the python interface provide a
+[DistOpt](https://github.com/apache/singa/blob/master/python/singa/opt.py) class
+to wrap an
+[optimizer](https://github.com/apache/singa/blob/master/python/singa/opt.py)
+object to perform distributed training based on MPI or multiprocessing. During
+the initialization, it creates a NCCL communicator object (from the C interface
+as mentioned in the subsection above). Then, this communicator object is used
+for every all-reduce operations in DistOpt.
+
+In MPI or multiprocess, each process has an individual rank, which gives
+information of which GPU the individual process is using. The training data is
+partitioned, so that each process can evaluate the sub-gradient based on the
+partitioned training data. Once the sub-gradient is calculated on each
+processes, the overall stochastic gradient is obtained by all-reducing the
+sub-gradients evaluated by all processes.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/download.md b/docs-site/website/versioned_docs/version-3.1.0/download.md
new file mode 100644
index 0000000..a607672
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/download.md
@@ -0,0 +1,208 @@
+---
+id: version-3.1.0-download-singa
+title: Download SINGA
+original_id: download-singa
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+## Verify
+
+To verify the downloaded tar.gz file, download the
+[KEYS](https://www.apache.org/dist/singa/KEYS) and ASC files and then execute
+the following commands
+
+```shell
+% gpg --import KEYS
+% gpg --verify downloaded_file.asc downloaded_file
+```
+
+You can also check the SHA512 or MD5 values to see if the download is completed.
+
+## V3.1.0 (30 October 2020):
+
+- [Apache SINGA 3.1.0](http://www.apache.org/dyn/closer.cgi/singa/3.1.0/apache-singa-3.1.0.tar.gz)
+  [\[SHA512\]](https://www.apache.org/dist/singa/3.1.0/apache-singa-3.1.0.tar.gz.sha512)
+  [\[ASC\]](https://www.apache.org/dist/singa/3.1.0/apache-singa-3.1.0.tar.gz.asc)
+- [Release Notes 3.1.0](http://singa.apache.org/docs/releases/RELEASE_NOTES_3.1.0)
+- Major changes:
+  - Update Tensor core:
+    - Support tensor transformation (reshape, transpose) for tensors up to 6
+      dimensions.
+    - Implement traverse_unary_transform in Cuda backend, which is similar to
+      CPP backend one.
+  - Add new tensor operators into the autograd module.
+  - Reconstruct sonnx to
+    - Support creating operators from both layer and autograd.
+    - Re-write SingaRep to provide a more powerful intermediate representation
+      of SINGA.
+    - Add a SONNXModel which implements from Model to provide uniform API and
+      features.
+  * Replace the Travis CI with Github workflow. Add quality and coverage
+    management.
+  * Add compiling and packaging scripts to create wheel packages for
+    distribution.
+  * Fix bugs
+    - Fix IMDB LSTM model example training script.
+    - Fix Tensor operation Mult on Broadcasting use cases.
+    - Gaussian function on Tensor now can run on Tensor with odd size.
+    - Updated a testing helper function gradients() in autograd to lookup param
+      gradient by param python object id for testing purpose.
+
+## V3.0.0 (18 April 2020):
+
+- [Apache SINGA 3.0.0](https://archive.apache.org/dist/singa/3.0.0/apache-singa-3.0.0.tar.gz)
+  [\[SHA512\]](https://archive.apache.org/dist/singa/3.0.0/apache-singa-3.0.0.tar.gz.sha512)
+  [\[ASC\]](https://archive.apache.org/dist/singa/3.0.0/apache-singa-3.0.0.tar.gz.asc)
+- [Release Notes 3.0.0](http://singa.apache.org/docs/releases/RELEASE_NOTES_3.0.0)
+- New features and major changes,
+  - Enhanced ONNX. Multiple ONNX models have been tested in SINGA.
+  - Distributed training with MPI and NCCL Communication optimization through
+    gradient sparsification and compression, and chunk transmission.
+  - Computational graph construction and optimization for speed and memory using
+    the graph.
+  - New documentation website (singa.apache.org) and API reference website
+    (apache-singa.rtfd.io).
+  - CI for code quality check.
+  - Replace MKLDNN with DNNL
+  - Update tensor APIs to support broadcasting operations.
+  - New autograd operators to support ONNX models.
+
+## Incubating v2.0.0 (20 April 2019):
+
+- [Apache SINGA 2.0.0 (incubating)](https://archive.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz)
+  [\[SHA512\]](https://archive.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz.sha512)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/2.0.0/apache-singa-incubating-2.0.0.tar.gz.asc)
+- [Release Notes 2.0.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_2.0.0.html)
+- New features and major updates,
+  - Enhance autograd (for Convolution networks and recurrent networks)
+  - Support ONNX
+  - Improve the CPP operations via Intel MKL DNN lib
+  - Implement tensor broadcasting
+  - Move Docker images under Apache user name
+  - Update dependent lib versions in conda-build config
+
+## Incubating v1.2.0 (6 June 2018):
+
+- [Apache SINGA 1.2.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz)
+  [\[SHA512\]](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz.sha512)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.2.0/apache-singa-incubating-1.2.0.tar.gz.asc)
+- [Release Notes 1.2.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_1.2.0.html)
+- New features and major updates,
+  - Implement autograd (currently support MLP model)
+  - Upgrade PySinga to support Python 3
+  - Improve the Tensor class with the stride field
+  - Upgrade cuDNN from V5 to V7
+  - Add VGG, Inception V4, ResNet, and DenseNet for ImageNet classification
+  - Create alias for conda packages
+  - Complete documentation in Chinese
+  - Add instructions for running Singa on Windows
+  - Update the compilation, CI
+  - Fix some bugs
+
+## Incubating v1.1.0 (12 February 2017):
+
+- [Apache SINGA 1.1.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.1.0/apache-singa-incubating-1.1.0.tar.gz.asc)
+- [Release Notes 1.1.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_1.1.0.html)
+- New features and major updates,
+  - Create Docker images (CPU and GPU versions)
+  - Create Amazon AMI for SINGA (CPU version)
+  - Integrate with Jenkins for automatically generating Wheel and Debian
+    packages (for installation), and updating the website.
+  - Enhance the FeedFowardNet, e.g., multiple inputs and verbose mode for
+    debugging
+  - Add Concat and Slice layers
+  - Extend CrossEntropyLoss to accept instance with multiple labels
+  - Add image_tool.py with image augmentation methods
+  - Support model loading and saving via the Snapshot API
+  - Compile SINGA source on Windows
+  - Compile mandatory dependent libraries together with SINGA code
+  - Enable Java binding (basic) for SINGA
+  - Add version ID in checkpointing files
+  - Add Rafiki toolkit for providing RESTFul APIs
+  - Add examples pretrained from Caffe, including GoogleNet
+
+## Incubating v1.0.0 (8 September 2016):
+
+- [Apache SINGA 1.0.0 (incubating)](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/1.0.0/apache-singa-incubating-1.0.0.tar.gz.asc)
+- [Release Notes 1.0.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_1.0.0.html)
+- New features and major updates,
+  - Tensor abstraction for supporting more machine learning models.
+  - Device abstraction for running on different hardware devices, including CPU,
+    (Nvidia/AMD) GPU and FPGA (to be tested in later versions).
+  - Replace GNU autotool with cmake for compilation.
+  - Support Mac OS
+  - Improve Python binding, including installation and programming
+  - More deep learning models, including VGG and ResNet
+  - More IO classes for reading/writing files and encoding/decoding data
+  - New network communication components directly based on Socket.
+  - Cudnn V5 with Dropout and RNN layers.
+  - Replace website building tool from maven to Sphinx
+  - Integrate Travis-CI
+
+## Incubating v0.3.0 (20 April 2016):
+
+- [Apache SINGA 0.3.0 (incubating)](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/0.3.0/apache-singa-incubating-0.3.0.tar.gz.asc)
+- [Release Notes 0.3.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_0.3.0.html)
+- New features and major updates,
+  - Training on GPU cluster enables training of deep learning models over a GPU
+    cluster.
+  - Python wrapper improvement makes it easy to configure the job, including
+    neural net and SGD algorithm.
+  - New SGD updaters are added, including Adam, AdaDelta and AdaMax.
+  - Installation has fewer dependent libraries for single node training.
+  - Heterogeneous training with CPU and GPU.
+  - Support cuDNN V4.
+  - Data prefetching.
+  - Fix some bugs.
+
+## Incubating v0.2.0 (14 January 2016):
+
+- [Apache SINGA 0.2.0 (incubating)](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/0.2.0/apache-singa-incubating-0.2.0.tar.gz.asc)
+- [Release Notes 0.2.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_0.2.0.html)
+- New features and major updates,
+  - Training on GPU enables training of complex models on a single node with
+    multiple GPU cards.
+  - Hybrid neural net partitioning supports data and model parallelism at the
+    same time.
+  - Python wrapper makes it easy to configure the job, including neural net and
+    SGD algorithm.
+  - RNN model and BPTT algorithm are implemented to support applications based
+    on RNN models, e.g., GRU.
+  - Cloud software integration includes Mesos, Docker and HDFS.
+  - Visualization of neural net structure and layer information, which is
+    helpful for debugging.
+  - Linear algebra functions and random functions against Blobs and raw data
+    pointers.
+  - New layers, including SoftmaxLayer, ArgSortLayer, DummyLayer, RNN layers and
+    cuDNN layers.
+  - Update Layer class to carry multiple data/grad Blobs.
+  - Extract features and test performance for new data by loading previously
+    trained model parameters.
+  - Add Store class for IO operations.
+
+## Incubating v0.1.0 (8 October 2015):
+
+- [Apache SINGA 0.1.0 (incubating)](https://archive.apache.org/dist/incubator/singa/apache-singa-incubating-0.1.0.tar.gz)
+  [\[MD5\]](https://archive.apache.org/dist/incubator/singa/apache-singa-incubating-0.1.0.tar.gz.md5)
+  [\[ASC\]](https://archive.apache.org/dist/incubator/singa/apache-singa-incubating-0.1.0.tar.gz.asc)
+- [Amazon EC2 image](https://console.aws.amazon.com/ec2/v2/home?region=ap-southeast-1#LaunchInstanceWizard:ami=ami-b41001e6)
+- [Release Notes 0.1.0 (incubating)](http://singa.apache.org/docs/releases/RELEASE_NOTES_0.1.0.html)
+- Major features include,
+  - Installation using GNU build utility
+  - Scripts for job management with zookeeper
+  - Programming model based on NeuralNet and Layer abstractions.
+  - System architecture based on Worker, Server and Stub.
+  - Training models from three different model categories, namely, feed-forward
+    models, energy models and RNN models.
+  - Synchronous and asynchronous distributed training frameworks using CPU
+  - Checkpoint and restore
+  - Unit test using gtest
diff --git a/docs-site/website/versioned_docs/version-3.1.0/examples.md b/docs-site/website/versioned_docs/version-3.1.0/examples.md
new file mode 100644
index 0000000..5522137
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/examples.md
@@ -0,0 +1,69 @@
+---
+id: version-3.1.0-examples
+title: Examples
+original_id: examples
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+This page lists some example deep learning tasks using SINGA. The source code is
+maintained inside SINGA repo on
+[Github](https://github.com/apache/singa/tree/master/examples). For examples
+running on CPU or single GPU using SINGA Python APIs, they are also available on
+[Google Colab](https://colab.research.google.com/). You can run them directly on
+Google Cloud without setting up the environment locally. The link to each
+example is given below.
+
+## Image Classification
+
+| Model       | Dataset                           | Links                                                                                                   |
+| ----------- | --------------------------------- | ------------------------------------------------------------------------------------------------------- |
+| Simple CNN  | MNIST, CIFAR10, CIFAR100          | [Colab](https://colab.research.google.com/drive/1fbGUs1AsoX6bU5F745RwQpohP4bHTktq)                      |
+| AlexNet     | ImageNet                          | [Cpp]()                                                                                                 |
+| VGG         | ImageNet                          | [Cpp](), [Python](), [Colab](https://colab.research.google.com/drive/14kxgRKtbjPCKKsDJVNi3AvTev81Gp_Ds) |
+| XceptionNet | MNIST, CIFAR10, CIFAR100          | [Python]()                                                                                              |
+| ResNet      | MNIST, CIFAR10, CIFAR100, CIFAR10 | [Python](), [Colab](https://colab.research.google.com/drive/1u1RYefSsVbiP4I-5wiBKHjsT9L0FxLm9)          |
+| MobileNet   | ImageNet                          | [Colab](https://colab.research.google.com/drive/1HsixqJMIpKyEPhkbB8jy7NwNEFEAUWAf)                      |
+
+## Object Detection
+
+| Model       | Dataset    | Links                                                                              |
+| ----------- | ---------- | ---------------------------------------------------------------------------------- |
+| Tiny YOLOv2 | Pascal VOC | [Colab](https://colab.research.google.com/drive/11V4I6cRjIJNUv5ZGsEGwqHuoQEie6b1T) |
+
+## Face and Emotion Recognition
+
+| Model           | Dataset                                                                                                                                                | Links                                                                              |
+| --------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------ | ---------------------------------------------------------------------------------- |
+| ArcFace         | Refined MS-Celeb-1M                                                                                                                                    | [Colab](https://colab.research.google.com/drive/1qanaqUKGIDtifdzEzJOHjEj4kYzA9uJC) |
+| Emotion FerPlus | [Facial Expression Recognition Challenge](https://www.kaggle.com/c/challenges-in-representation-learning-facial-expression-recognition-challenge/data) | [Colab](https://colab.research.google.com/drive/1XHtBQGRhe58PDi4LGYJzYueWBeWbO23r) |
+
+## Image Generation
+
+| Model | Dataset | Links                                                                              |
+| ----- | ------- | ---------------------------------------------------------------------------------- |
+| GAN   | MNIST   | [Colab](https://colab.research.google.com/drive/1f86MNDW47DJqHoIqWD1tOxcyx2MWys8L) |
+| LSGAN | MNIST   | [Colab](https://colab.research.google.com/drive/1C6jNRf28vnFOI9JVM4lpkJPqxsnhxdol) |
+
+## Machine Comprehension
+
+| Model      | Dataset                                                                   | Links                                                                              |
+| ---------- | ------------------------------------------------------------------------- | ---------------------------------------------------------------------------------- |
+| Bert-Squad | [SQuAD v1.1](https://rajpurkar.github.io/SQuAD-explorer/explore/1.1/dev/) | [Colab](https://colab.research.google.com/drive/1kud-lUPjS_u-TkDAzihBTw0Vqr0FjCE-) |
+
+## Text Classification
+
+| Model       | Dataset | Links      |
+| ----------- | ------- | ---------- |
+| Simple LSTM | IMDB    | [python]() |
+
+## Text Ranking
+
+| Model  | Dataset     | Links      |
+| ------ | ----------- | ---------- |
+| BiLSTM | InsuranceQA | [python]() |
+
+## Misc.
+
+- Restricted Boltzmann Machine over the MNIST dataset, [source](),
+  [Colab](https://colab.research.google.com/drive/19996noGu9JyHHkVmp4edBGu7PJSRQKsd).
diff --git a/docs-site/website/versioned_docs/version-3.1.0/graph.md b/docs-site/website/versioned_docs/version-3.1.0/graph.md
new file mode 100644
index 0000000..0cb7ef2
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/graph.md
@@ -0,0 +1,532 @@
+---
+id: version-3.1.0-graph
+title: Model
+original_id: graph
+---
+
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License. -->
+
+The forward and backward propagation in a neural network can be represented
+using a set of operations such as convolution and pooling. Each operation takes
+some input [tensors](./tensor) and applies an [operator](./autograd) to generate
+output tensors. By representing each operator as a node and each tensor as an
+edge, all operations form a computational graph. With the computational graph,
+speed and memory optimization can be conducted by scheduling the execution of
+the operations and memory allocation/release intelligently. In SINGA, users only
+need to define the neural network model using the
+[Model](https://github.com/apache/singa/blob/master/python/singa/model.py) API.
+The graph is constructed and optimized at the C++ backend automatically.
+
+In this way, on the one hand, users implement a network using the
+[Model](./graph) API following the imperative programming style like PyTorch.
+Different from PyTorch which recreates the operations in every iteration, SINGA
+buffers the operations to create a computational graph implicitly (when this
+feature is enabled) after the first iteration. Therefore, on the other hand,
+SINGA has a similar computational graph as the one created by libraries using
+declarative programming, e.g., TensorFlow. Consequently, it can enjoy the
+optimizations done over the graph.
+
+## Example
+
+The following code illustrates the usage of the `Model` API.
+
+1. Implement the new model as a subclass of the Model class.
+
+```Python
+class CNN(model.Model):
+
+    def __init__(self, num_classes=10, num_channels=1):
+        super(CNN, self).__init__()
+        self.conv1 = layer.Conv2d(num_channels, 20, 5, padding=0, activation="RELU")
+        self.conv2 = layer.Conv2d(20, 50, 5, padding=0, activation="RELU")
+        self.linear1 = layer.Linear(500)
+        self.linear2 = layer.Linear(num_classes)
+        self.pooling1 = layer.MaxPool2d(2, 2, padding=0)
+        self.pooling2 = layer.MaxPool2d(2, 2, padding=0)
+        self.relu = layer.ReLU()
+        self.flatten = layer.Flatten()
+        self.softmax_cross_entropy = layer.SoftMaxCrossEntropy()
+
+    def forward(self, x):
+        y = self.conv1(x)
+        y = self.pooling1(y)
+        y = self.conv2(y)
+        y = self.pooling2(y)
+        y = self.flatten(y)
+        y = self.linear1(y)
+        y = self.relu(y)
+        y = self.linear2(y)
+        return y
+
+    def train_one_batch(self, x, y):
+        out = self.forward(x)
+        loss = self.softmax_cross_entropy(out, y)
+        self.optimizer(loss)
+        return out, loss
+```
+
+2. Create an instance of model, optimizer, device, etc. Compile the model
+
+```python
+model = CNN()
+
+# initialize optimizer and attach it to the model
+sgd = opt.SGD(lr=0.005, momentum=0.9, weight_decay=1e-5)
+model.set_optimizer(sgd)
+
+# initialize device
+dev = device.create_cuda_gpu()
+
+# input and target placeholders for the model
+tx = tensor.Tensor((batch_size, 1, IMG_SIZE, IMG_SIZE), dev, tensor.float32)
+ty = tensor.Tensor((batch_size, num_classes), dev, tensor.int32)
+
+# compile the model before training
+model.compile([tx], is_train=True, use_graph=True, sequential=False)
+```
+
+3. Train the model iteratively
+
+```python
+for b in range(num_train_batch):
+    # generate the next mini-batch
+    x, y = ...
+
+    # Copy the data into input tensors
+    tx.copy_from_numpy(x)
+    ty.copy_from_numpy(y)
+
+    # Training with one batch
+    out, loss = model(tx, ty)
+```
+
+A Google Colab notebook of this example is available
+[here](https://colab.research.google.com/drive/1fbGUs1AsoX6bU5F745RwQpohP4bHTktq).
+
+More examples:
+
+- [MLP](https://github.com/apache/singa/blob/master/examples/mlp/model.py)
+- [CNN](https://github.com/apache/singa/blob/master/examples/cnn/model/cnn.py)
+- [ResNet](https://github.com/apache/singa/blob/master/examples/cnn/model/resnet.py)
+
+## Implementation
+
+### Graph Construction
+
+SINGA constructs the computational graph in three steps:
+
+1. buffer the operations
+2. analyze the dependencies operations
+3. create the nodes and edges based on the dependencies
+
+Take the matrix multiplication operation from the dense layer of a
+[MLP model](https://github.com/apache/singa/blob/master/examples/mlp/model.py)
+as an example. The operation is called in the `forward` function of the MLP
+class
+
+```python
+class MLP(model.Model):
+
+    def __init__(self, data_size=10, perceptron_size=100, num_classes=10):
+        super(MLP, self).__init__()
+        self.linear1 = layer.Linear(perceptron_size)
+        ...
+
+    def forward(self, inputs):
+        y = self.linear1(inputs)
+        ...
+```
+
+The `Linear` layer is composed of the `mutmul` operator. `autograd` implements
+the `matmul` operator by calling the function `Mult` exposed from CPP via SWIG.
+
+```python
+# implementation of matmul()
+singa.Mult(inputs, w)
+```
+
+At the backend, the `Mult` function is implemented by calling `GEMV` a CBLAS
+function. Instead of calling `GEMV` directly, `Mult` submits `GEMV` and the
+arguments to the device as follows,
+
+```c++
+// implementation of Mult()
+C->device()->Exec(
+    [a, A, b, B, CRef](Context *ctx) mutable {
+        GEMV<DType, Lang>(a, A, B, b, &CRef, ctx);
+    },
+    read_blocks, {C->block()});
+```
+
+The `Exec` function of `Device` buffers the function and its arguments. In
+addition, it also has the information about the blocks (a block is a chunk of
+memory for a tensor) to be read and written by this function.
+
+Once `Model.forward()` has been executed once, all operations are buffered by
+`Device`. Next, the read/write information of all operations are analyzed to
+create the computational graph. For example, if a block `b` is written by one
+operation O1 and is later read by another operation O2, we would know O2 depends
+on O1 and there is a directed edge from A to B, which represents block `b` (or
+its tensor). After that a directed acyclic graph is constructed as shown below.
+The graph is constructed once.
+
+![The computational graph of MLP](assets/GraphOfMLP.png)
+
+<br/>**Figure 1 - The computational graph of the MLP example.**
+
+### Optimization
+
+Currently, the following optimizations are done based on the computational
+graph.
+
+**Lazy allocation** When tensor/blocks are created, devices do not allocate
+memory for them immediately. Instead, when the block is accessed for the first
+time, the memory is allocated.
+
+**Automatic recycling** The reference count of each tensor/block is calculated
+based on the graph. Before executing the operations, the reference count is the
+number of operations that read this block. During the execution, once an
+operation is executed, the reference count of the every input block is decreased
+by 1. If one block's reference count reaches 0, it means that this block will
+not be read again in the remaining operations. Therefore, its memory can be
+released safely. In addition, SINGA tracks the usage of the block outside of the
+graph. If a block is used by Python code (not by autograd operators), it will
+not be recycled.
+
+**Memory sharing** SINGA uses memory pool, e.g.,
+[CnMem](https://github.com/NVIDIA/cnmem) to manage CUDA memory. With _Automatic
+recycling_ and memory pool, SINGA can share the memory among tensors. Consider
+two operations `c = a + b` and `d=2xc`. Before executing the second operation,
+according to _Lazy allocation_, the memory of d should be allocated. Suppose `a`
+is not used in the rest operations. According to Automatic recycling, the block
+of `a` will be released after the first operation. Therefore, SINGA would submit
+four operations to the CUDA stream: addition, free `a`, malloc `b`, and
+multiplication. The memory pool is then able to share the memory released by `a`
+with `b` instead of ask the GPU to do real malloc for `b`.
+
+Other optimization techniques e.g., from compliers, such as common
+sub-expression elimination and parallelizing operations on different CUDA
+streams can also be applied.
+
+## New Operator
+
+Each operator defined in `autograd` module implements two functions: forward and
+backward, which are implemented by calling the operators from the backend. To
+add a new operator in `autograd`, you need to add the multiple operators at the
+backend.
+
+Take the
+[Conv2d](https://github.com/apache/singa/blob/master/python/singa/autograd.py)
+operator as an example, at the Python side, the forward and backward function
+are implemented by calling the operators from the backend depending on the
+device type.
+
+```python
+class _Conv2d(Operation):
+
+    def forward(self, x, W, b=None):
+        ......
+        if training:
+            if self.handle.bias_term:
+                self.inputs = (x, W, b) # record x, W, b
+            else:
+                self.inputs = (x, W)
+
+        if (type(self.handle) != singa.ConvHandle):
+            return singa.GpuConvForward(x, W, b, self.handle)
+        else:
+            return singa.CpuConvForward(x, W, b, self.handle)
+
+    def backward(self, dy):
+        if (type(self.handle) != singa.ConvHandle):
+            dx = singa.GpuConvBackwardx(dy, self.inputs[1], self.inputs[0],
+                                        self.handle)
+            dW = singa.GpuConvBackwardW(dy, self.inputs[0], self.inputs[1],
+                                        self.handle)
+            db = singa.GpuConvBackwardb(
+                dy, self.inputs[2],
+                self.handle) if self.handle.bias_term else None
+        else:
+            dx = singa.CpuConvBackwardx(dy, self.inputs[1], self.inputs[0],
+                                        self.handle)
+            dW = singa.CpuConvBackwardW(dy, self.inputs[0], self.inputs[1],
+                                        self.handle)
+            db = singa.CpuConvBackwardb(
+                dy, self.inputs[2],
+                self.handle) if self.handle.bias_term else None
+        if db:
+            return dx, dW, db
+        else:
+            return dx, dW
+```
+
+For each operator at the backend, it should be implemented in the following way:
+
+- Suppose the operator is `foo()`; its real implementation should be wrapped in
+  another function e.g., `_foo()`. `foo()` passes `_foo` together with the
+  arguments as a lambda function to `Device`'s `Exec` function for buffering.
+  The blocks to be read and written are also passed to `Exec`.
+
+- All arguments used in the lambda expression need to be captured according to
+  the following rules.
+
+  - `capture by value`: If the argument variable is a local variable or will be
+    immediately released (e.g. intermediate tensors). Otherwise, these variables
+    will be destroyed once `foo()` exists.
+  - `capture by reference`:If the variable is recorded on the python side or a
+    persistent variable (e.g. parameter W and ConvHand in the Conv2d class).
+
+  - `mutable`: The lambda expression should have the mutable tag if a variable
+    captured by value is modified in `_foo()`
+
+Here is one
+[example](https://github.com/apache/singa/blob/master/src/model/operation/convolution.cc)
+operator implemented at the backend.
+
+```c++
+Tensor GpuConvBackwardx(const Tensor &dy, const Tensor &W, const Tensor &x,
+                        const CudnnConvHandle &cch) {
+  CHECK_EQ(dy.device()->lang(), kCuda);
+
+  Tensor dx;
+  dx.ResetLike(x);
+
+  dy.device()->Exec(
+      /*
+       * dx is a local variable so it's captured by value
+       * dy is an intermediate tensor and isn't recorded on the python side
+       * W is an intermediate tensor but it's recorded on the python side
+       * chh is a variable and it's recorded on the python side
+       */
+      [dx, dy, &W, &cch](Context *ctx) mutable {
+        Block *wblock = W.block(), *dyblock = dy.block(), *dxblock = dx.block();
+        float alpha = 1.f, beta = 0.f;
+        cudnnConvolutionBackwardData(
+            ctx->cudnn_handle, &alpha, cch.filter_desc, wblock->data(),
+            cch.y_desc, dyblock->data(), cch.conv_desc, cch.bp_data_alg,
+            cch.workspace.block()->mutable_data(),
+            cch.workspace_count * sizeof(float), &beta, cch.x_desc,
+            dxblock->mutable_data());
+      },
+      {dy.block(), W.block()}, {dx.block(), cch.workspace.block()});
+      /* the lambda expression reads the blocks of tensor dy and w
+       * and writes the blocks of tensor dx and chh.workspace
+       */
+
+  return dx;
+}
+```
+
+## Benchmark
+
+### Single node
+
+- Experiment settings
+  - Model
+    - Using layer: ResNet50 in
+      [resnet.py](https://github.com/apache/singa/blob/master/examples/cnn/autograd/resnet_cifar10.py)
+    - Using model: ResNet50 in
+      [resnet.py](https://github.com/apache/singa/blob/master/examples/cnn/model/resnet.py)
+  - GPU: NVIDIA RTX 2080Ti
+- Notations
+  - `s` :second
+  - `it` : iteration
+  - `Mem`:peak memory usage of single GPU
+  - `Throughout`:number of images processed per second
+  - `Time`:total time
+  - `Speed`:iterations per second
+  - `Reduction`:the memory usage reduction rate compared with that using layer
+  - `Speedup`: speedup ratio compared with dev branch
+- Result
+  <table style="text-align: center">
+      <tr>
+          <th style="text-align: center">Batchsize</th>
+          <th style="text-align: center">Cases</th>
+          <th style="text-align: center">Mem(MB)</th>
+          <th style="text-align: center">Time(s)</th>
+          <th style="text-align: center">Speed(it/s)</th>
+          <th style="text-align: center">Throughput</th>
+          <th style="text-align: center">Reduction</th>
+          <th style="text-align: center">Speedup</th>
+      </tr>
+      <tr>
+          <td rowspan="4">16</td>
+          <td nowrap>layer</td>
+          <td>4975</td>
+          <td>14.1952</td>
+          <td>14.0893</td>
+          <td>225.4285</td>
+          <td>0.00%</td>
+          <td>1.0000</td>
+      </tr>
+      <tr>
+          <td nowrap>model:disable graph</td>
+          <td>4995</td>
+          <td>14.1264</td>
+          <td>14.1579</td>
+          <td>226.5261</td>
+          <td>-0.40%</td>
+          <td>1.0049</td>
+      </tr>
+      <tr>
+          <td nowrap>model:enable graph, bfs</td>
+          <td>3283</td>
+          <td>13.7438</td>
+          <td>14.5520</td>
+          <td>232.8318</td>
+          <td>34.01%</td>
+          <td>1.0328</td>
+      </tr>
+      <tr>
+          <td nowrap>model:enable graph, serial</td>
+          <td>3265</td>
+          <td>13.7420</td>
+          <td>14.5540</td>
+          <td>232.8635</td>
+          <td>34.37%</td>
+          <td>1.0330</td>
+      </tr>
+      <tr>
+          <td rowspan="4">32</td>
+          <td nowrap>layer</td>
+          <td>10119</td>
+          <td>13.4587</td>
+          <td>7.4302</td>
+          <td>237.7649</td>
+          <td>0.00%</td>
+          <td>1.0000</td>
+      </tr>
+      <tr>
+          <td nowrap>model:disable graph</td>
+          <td>10109</td>
+          <td>13.2952</td>
+          <td>7.5315</td>
+          <td>240.6875</td>
+          <td>0.10%</td>
+          <td>1.0123</td>
+      </tr>
+      <tr>
+          <td nowrap>model:enable graph, bfs</td>
+          <td>6839</td>
+          <td>13.1059</td>
+          <td>7.6302</td>
+          <td>244.1648</td>
+          <td>32.41%</td>
+          <td>1.0269</td>
+      </tr>
+      <tr>
+          <td nowrap>model:enable graph, serial</td>
+          <td>6845</td>
+          <td>13.0489</td>
+          <td>7.6635</td>
+          <td>245.2312</td>
+          <td>32.35%</td>
+          <td>1.0314</td>
+      </tr>
+  </table>
+
+### Multi processes
+
+- Experiment settings
+  - API
+    - using Layer: ResNet50 in
+      [resnet_dist.py](https://github.com/apache/singa/blob/master/examples/cnn/autograd/resnet_dist.py)
+    - using Model: ResNet50 in
+      [resnet.py](https://github.com/apache/singa/blob/master/examples/cnn/model/resnet.py)
+  - GPU: NVIDIA RTX 2080Ti \* 2
+  - MPI: two MPI processes on one node
+- Notations: the same as above
+- Result
+  <table style="text-align: center">
+      <tr>
+          <th style="text-align: center">Batchsize</th>
+          <th style="text-align: center">Cases</th>
+          <th style="text-align: center">Mem(MB)</th>
+          <th style="text-align: center">Time(s)</th>
+          <th style="text-align: center">Speed(it/s)</th>
+          <th style="text-align: center">Throughput</th>
+          <th style="text-align: center">Reduction</th>
+          <th style="text-align: center">Speedup</th>
+      </tr>
+      <tr>
+          <td rowspan="4">16</td>
+          <td nowrap>layer</td>
+          <td>5439</td>
+          <td>17.3323</td>
+          <td>11.5391</td>
+          <td>369.2522</td>
+          <td>0.00%</td>
+          <td>1.0000</td>
+      </tr>
+      <tr>
+          <td nowrap>model:disable graph</td>
+          <td>5427</td>
+          <td>17.8232</td>
+          <td>11.2213</td>
+          <td>359.0831</td>
+          <td>0.22%</td>
+          <td>0.9725</td>
+      </tr>
+      <tr>
+          <td nowrap>model:enable graph, bfs</td>
+          <td>3389</td>
+          <td>18.2310</td>
+          <td>10.9703</td>
+          <td>351.0504</td>
+          <td>37.69%</td>
+          <td>0.9507</td>
+      </tr>
+      <tr>
+          <td nowrap>model:enable graph, serial</td>
+          <td>3437</td>
+          <td>17.0389</td>
+          <td>11.7378</td>
+          <td>375.6103</td>
+          <td>36.81%</td>
+          <td>1.0172</td>
+      </tr>
+      <tr>
+          <td rowspan="4">32</td>
+          <td nowrap>layer</td>
+          <td>10547</td>
+          <td>14.8635</td>
+          <td>6.7279</td>
+          <td>430.5858</td>
+          <td>0.00%</td>
+          <td>1.0000</td>
+      </tr>
+      <tr>
+          <td nowrap>model:disable graph</td>
+          <td>10503</td>
+          <td>14.7746</td>
+          <td>6.7684</td>
+          <td>433.1748</td>
+          <td>0.42%</td>
+          <td>1.0060</td>
+      </tr>
+      <tr>
+          <td nowrap>model:enable graph, bfs</td>
+          <td>6935</td>
+          <td>14.8553</td>
+          <td>6.7316</td>
+          <td>430.8231</td>
+          <td>34.25%</td>
+          <td>1.0006</td>
+      </tr>
+      <tr>
+          <td nowrap>model:enable graph, serial</td>
+          <td>7027</td>
+          <td>14.3271</td>
+          <td>6.9798</td>
+          <td>446.7074</td>
+          <td>33.37%</td>
+          <td>1.0374</td>
+      </tr>
+  </table>
+
+### Conclusion
+
+- Training with the computational graph enabled can significantly reduce the
+  memory footprint.
+- Currently, there is a little improvement in terms of speed. More optimizations
+  can be done towards the efficiency.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/how-to-release.md b/docs-site/website/versioned_docs/version-3.1.0/how-to-release.md
new file mode 100644
index 0000000..b991987
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/how-to-release.md
@@ -0,0 +1,209 @@
+---
+id: version-3.1.0-how-to-release
+title: How to Prepare a Release
+original_id: how-to-release
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+This is a guide for the
+[release preparing process](http://www.apache.org/dev/release-publishing.html)
+in SINGA.
+
+1. Select a release manager. The release manager (RM) is the coordinator for the
+   release process. It is the RM's signature (.asc) that is uploaded together
+   with the release. The RM generates KEY (RSA 4096-bit) and uploads it to a
+   public key server. The RM needs to get his key endorsed (signed) by other
+   Apache user, to be connected to the web of trust. He should first ask the
+   mentor to help signing his key.
+   [How to generate the key](http://www.apache.org/dev/release-signing.html)?
+
+2. Check license. [FAQ](https://www.apache.org/legal/src-headers.html#faq-docs);
+   [SINGA Issue](https://issues.apache.org/jira/projects/SINGA/issues/SINGA-447)
+
+   - The codebase does not include third-party code which is not compatible to
+     APL;
+   - The dependencies are compatible with APL. GNU-like licenses are NOT
+     compatible;
+   - All source files written by us MUST include the Apache license header:
+     http://www.apache.org/legal/src-headers.html. There's a script in there
+     which helps propagating the header to all files.
+   - Update the LICENSE file. If we include any third party code in the release
+     package which is not APL, must state it at the end of the NOTICE file.
+
+3. Bump the version. Check code and documentation
+
+   - The build process is error-free.
+   - Unit tests are included (as much as possible)
+   - Conda packages run without errors.
+   - The online documentation on the Apache website is up to date.
+
+4. Prepare the RELEASE_NOTES file. Include the following items, Introduction,
+   Features, Bugs (link to JIRA or Github PR), Changes, Dependency list,
+   Incompatibility issues. Follow this
+   [example](http://commons.apache.org/proper/commons-digester/commons-digester-3.0/RELEASE-NOTES.txt).
+
+5. Package the release candidate. The release should be packaged into :
+   apache-singa-VERSION.tar.gz. The release should not include any binary files
+   including git files. However, the CMake compilation depends on the git tag to
+   get the version numbers; to remove this dependency, you need to manually
+   update the CMakeLists.txt file to set the version numbers.
+
+   ```
+   # remove the following lines
+   include(GetGitRevisionDescription)
+   git_describe(VERSION --tags --dirty=-d)
+   string(REGEX REPLACE "^([0-9]+)\\..*" "\\1" VERSION_MAJOR "${VERSION}")
+   string(REGEX REPLACE "^[0-9]+\\.([0-9]+).*" "\\1" VERSION_MINOR "${VERSION}")
+   string(REGEX REPLACE "^[0-9]+\\.[0-9]+\\.([0-9]+).*" "\\1" VERSION_PATCH "${VERSION}")
+
+   # set the numbers manually
+   SET(PACKAGE_VERSION 3.0.0)
+   SET(VERSION 3.0.0)
+   SET(SINGA_MAJOR_VERSION 3)  # 0 -
+   SET(SINGA_MINOR_VERSION 0)  # 0 - 9
+   SET(SINGA_PATCH_VERSION 0)  # 0 - 99
+   ```
+
+   Upload the package to the
+   [stage repo](https://dist.apache.org/repos/dist/dev/singa/). The tar file,
+   signature, KEY and SHA256 checksum file should be included. MD5 is no longer
+   used. Policy is
+   [here](http://www.apache.org/dev/release-distribution#sigs-and-sums). The
+   stage folder should include:
+
+   - apache-singa-VERSION.tar.gz
+   - apache-singa-VERSION.acs
+   - apache-singa-VERSION.SHA256
+
+   The commands to create these files and upload them to the stage svn repo:
+
+   ```sh
+   # in singa repo
+   rm -rf .git
+   rm -rf rafiki/*
+   cd ..
+   tar -czvf apache-singa-VERSION.tar.gz  singa/
+
+   mkdir stage
+   cd stage
+   svn co https://dist.apache.org/repos/dist/dev/singa/
+   cd singa
+   # copy the KEYS file from singa repo to this folder if it is not here
+   cp ../../singa/KEYS .
+   mkdir VERSION
+   # copy the tar.gz file
+   mv ../../apache-singa-VERSION.tar.gz VERSION/
+   cd VERSION
+   sha512sum apache-singa-VERSION.tar.gz > apache-singa-VERSION.tar.gz.sha512
+   gpg --armor --output apache-singa-VERSION.tar.gz.asc --detach-sig apache-singa-VERSION.tar.gz
+   cd ..
+   svn add VERSION
+   svn commit
+   ```
+
+6) Call for vote by sending an email. An example is provided as follows.
+
+   ```
+   To: dev@singa.apache.org
+   Subject: [VOTE] Release apache-singa-X.Y.Z (release candidate N)
+
+   Hi all,
+
+   I have created a build for Apache SINGA 3.1.0, release candidate 2.
+
+   The release note is at
+   https://github.com/apache/singa/blob/master/RELEASE_NOTES.
+
+   The artifacts to be voted on are located here:
+   https://dist.apache.org/repos/dist/dev/singa/3.1.0.rc2/apache-singa-3.1.0.rc2.tar.gz
+    
+   The hashes of the artifacts are as follows:
+   SHA512: 84545499ad36da108c6a599edd1d853f82d331bc03273b5278515554866f0c698e881f956b2eabcb6b29c07fa9fa4ff1add5a777b58db8a6a2362cf383b5c04d 
+
+   Release artifacts are signed with the followingkey:
+   https://dist.apache.org/repos/dist/dev/singa/KEYS
+
+   The signature file is:
+   https://dist.apache.org/repos/dist/dev/singa/3.1.0.rc2/apache-singa-3.1.0.rc2.tar.gz.asc
+
+   The Github tag is at:
+   https://github.com/apache/singa/releases/tag/3.1.0.rc2
+
+   The documentation website is at
+   http://singa.apache.org/docs/next/installation/
+
+   Some examples are available for testing:
+   https://github.com/apache/singa/tree/master/examples
+   ```
+
+Please vote on releasing this package. The vote is open for at least 72 hours
+and passes if a majority of at least three +1 votes are cast.
+
+[ ] +1 Release this package as Apache SINGA X.Y.Z [ ] 0 I don't feel strongly
+about it, but I'm okay with the release [ ] -1 Do not release this package
+because...
+
+Here is my vote: +1
+
+```
+
+7) Wait at least 48 hours for test responses. Any PMC, committer or contributor
+can test features for releasing, and feedback. Everyone should check these
+before vote +1. If the vote passes, then send the result email. Otherwise,
+repeat from the beginning.
+
+```
+
+To: dev@singa.apache.org Subject: [RESULT][vote] Release apache-singa-X.Y.Z
+(release candidate N)
+
+Thanks to everyone who has voted and given their comments. The tally is as
+follows.
+
+N binding +1s: <names>
+
+N non-binding +1s: <names>
+
+No 0s or -1s.
+
+I am delighted to announce that the proposal to release Apache SINGA X.Y.Z has
+passed.
+
+````
+
+8) Upload the package for
+[distribution](http://www.apache.org/dev/release-publishing.html#distribution)
+to https://dist.apache.org/repos/dist/release/singa/.
+
+9) Update the Download page of SINGA website. The tar.gz file MUST be downloaded
+from mirror, using closer.cgi script; other artifacts MUST be downloaded from
+main Apache site. More details
+[here](http://www.apache.org/dev/release-download-pages.html). Some feedback
+we got during the previous releases: "Download pages must only link to formal
+releases, so must not include links to GitHub.", "Links to KEYS, sigs and
+hashes must not use dist.apache.org; instead use
+https://www.apache.org/dist/singa/...;", "Also you only need one KEYS link,
+and there should be a description of how to use KEYS + sig or hash to verify
+the downloads."
+
+10) Remove the RC tag and compile the conda packages.
+
+11) Publish the release information.
+
+ ```
+ To: announce@apache.org, dev@singa.apache.org
+ Subject: [ANNOUNCE] Apache SINGA X.Y.Z released
+
+ We are pleased to announce that SINGA X.Y.Z is released.
+
+ SINGA is a general distributed deep learning platform
+ for training big deep learning models over large datasets.
+ The release is available at: http://singa.apache.org/downloads.html
+ The main features of this release include XXX
+ We look forward to hearing your feedback, suggestions,
+ and contributions to the project.
+
+ On behalf of the SINGA team, {SINGA Team Member Name}
+ ```
+````
diff --git a/docs-site/website/versioned_docs/version-3.1.0/install-win.md b/docs-site/website/versioned_docs/version-3.1.0/install-win.md
new file mode 100644
index 0000000..2b601d3
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/install-win.md
@@ -0,0 +1,400 @@
+---
+id: version-3.1.0-install-win
+title: Build SINGA on Windows
+original_id: install-win
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+The process of building SINGA from source on Microsoft Windows has four parts:
+install dependencies, build SINGA source, (optionally) install the python module
+and (optionally) run the unit tests.
+
+## Install Dependencies
+
+You may create a folder for building the dependencies.
+
+The dependencies are:
+
+- Compiler and IDE
+  - Visual Studio. The community edition is free and can be used to build SINGA.
+    https://www.visualstudio.com/
+- CMake
+  - Can be downloaded from http://cmake.org/
+  - Make sure the path to cmake executable is in the system path, or use full
+    path when calling cmake.
+- SWIG
+
+  - Can be downloaded from http://swig.org/
+  - Make sure the path to swig executable is in the system path, or use full
+    path when calling swig. Use a recent version such as 3.0.12.
+
+- Protocol Buffers
+  - Download a suitable version such as 2.6.1:
+    https://github.com/google/protobuf/releases/tag/v2.6.1 .
+  - Download both protobuf-2.6.1.zip and protoc-2.6.1-win32.zip .
+  - Extract both of them in dependencies folder. Add the path to protoc
+    executable to the system path, or use full path when calling it.
+  - Open the Visual Studio solution which can be found in vsproject folder.
+  - Change the build settings to Release and x64.
+  - build libprotobuf project.
+- Openblas
+
+  - Download a suitable source version such as 0.2.20 from
+    http://www.openblas.net
+  - Extract the source in the dependencies folder.
+  - If you don't have Perl installed, download a perl environment such as
+    Strawberry Perl (http://strawberryperl.com/)
+  - Build the Visual Studio solution by running this command in the source
+    folder:
+
+  ```bash
+  cmake -G "Visual Studio 15 2017 Win64"
+  ```
+
+  - Open the Visual Studio solution and change the build settings to Release and
+    x64.
+  - Build libopenblas project
+
+- Google glog
+  - Download a suitable version such as 0.3.5 from
+    https://github.com/google/glog/releases
+  - Extract the source in the dependencies folder.
+  - Open the Visual Studio solution.
+  - Change the build settings to Release and x64.
+  - Build libglog project
+
+## Build SINGA source
+
+- Download SINGA source code
+- Compile the protobuf files:
+
+  - Go to src/proto folder
+
+  ```shell
+  mkdir python_out
+  protoc.exe *.proto --python_out python_out
+  ```
+
+- Generate swig interfaces for C++ and Python: Go to src/api
+
+  ```shell
+  swig -python -c++ singa.i
+  ```
+
+- generate Visual Studio solution for SINGA: Go to SINGA source code root folder
+
+  ```shell
+  mkdir build
+  cd build
+  ```
+
+- Call cmake and add the paths in your system similar to the following example:
+
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64" ^
+    -DGLOG_INCLUDE_DIR="D:/WinSinga/dependencies/glog-0.3.5/src/windows" ^
+    -DGLOG_LIBRARIES="D:/WinSinga/dependencies/glog-0.3.5/x64/Release" ^
+    -DCBLAS_INCLUDE_DIR="D:/WinSinga/dependencies/openblas-0.2.20/lapack-netlib/CBLAS/include" ^
+    -DCBLAS_LIBRARIES="D:/WinSinga/dependencies/openblas-0.2.20/lib/RELEASE" ^
+    -DProtobuf_INCLUDE_DIR="D:/WinSinga/dependencies/protobuf-2.6.1/src" ^
+    -DProtobuf_LIBRARIES="D:/WinSinga/dependencies/protobuf-2.6.1/vsprojects/x64/Release" ^
+    -DProtobuf_PROTOC_EXECUTABLE="D:/WinSinga/dependencies/protoc-2.6.1-win32/protoc.exe" ^
+    ..
+  ```
+
+- Open the generated solution in Visual Studio
+- Change the build settings to Release and x64
+- Add the singa_wrap.cxx file from src/api to the singa_objects project
+- In the singa_objects project, open Additional Include Directories.
+- Add Python include path
+- Add numpy include path
+- Add protobuf include path
+- In the preprocessor definitions of the singa_objects project, add USE_GLOG
+- Build singa_objects project
+
+- In singa project:
+
+  - add singa_wrap.obj to Object Libraries
+  - change target name to \_singa_wrap
+  - change target extension to .pyd
+  - change configuration type to Dynamic Library (.dll)
+  - go to Additional Library Directories and add the path to python, openblas,
+    protobuf and glog libraries
+  - go to Additional Dependencies and add libopenblas.lib, libglog.lib and
+    libprotobuf.lib
+
+- build singa project
+
+## Install Python module
+
+- Change `_singa_wrap.so` to `_singa_wrap.pyd` in build/python/setup.py
+- Copy the files in `src/proto/python_out` to `build/python/singa/proto`
+
+- Optionally create and activate a virtual environment:
+
+  ```shell
+  mkdir SingaEnv
+  virtualenv SingaEnv
+  SingaEnv\Scripts\activate
+  ```
+
+- go to build/python folder and run:
+
+  ```shell
+  python setup.py install
+  ```
+
+- Make \_singa_wrap.pyd, libglog.dll and libopenblas.dll available by adding
+  them to the path or by copying them to singa package folder in the python
+  site-packages
+
+- Verify that SINGA is installed by running:
+
+  ```shell
+  python -c "from singa import tensor"
+  ```
+
+A video tutorial for the build process can be found here:
+
+[![youtube video](https://img.youtube.com/vi/cteER7WeiGk/0.jpg)](https://www.youtube.com/watch?v=cteER7WeiGk)
+
+## Run Unit Tests
+
+- In the test folder, generate the Visual Studio solution:
+
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64"
+  ```
+
+- Open the generated solution in Visual Studio.
+
+- Change the build settings to Release and x64.
+
+- Build glog project.
+
+- In test_singa project:
+
+  - Add USE_GLOG to the Preprocessor Definitions.
+  - In Additional Include Directories, add path of GLOG_INCLUDE_DIR,
+    CBLAS_INCLUDE_DIR and Protobuf_INCLUDE_DIR which were used in step 2 above.
+    Add also build and build/include folders.
+  - Go to Additional Library Directories and add the path to openblas, protobuf
+    and glog libraries. Add also build/src/singa_objects.dir/Release.
+  - Go to Additional Dependencies and add libopenblas.lib, libglog.lib and
+    libprotobuf.lib. Fix the names of the two libraries: gtest.lib and
+    singa_objects.lib.
+
+- Build test_singa project.
+
+- Make libglog.dll and libopenblas.dll available by adding them to the path or
+  by copying them to test/release folder
+
+- The unit tests can be executed
+
+  - From the command line:
+
+  ```shell
+  test_singa.exe
+  ```
+
+  - From Visual Studio:
+    - right click on the test_singa project and choose 'Set as StartUp Project'.
+    - from the Debug menu, choose 'Start Without Debugging'
+
+A video tutorial for running the unit tests can be found here:
+
+[![youtube video](https://img.youtube.com/vi/393gPtzMN1k/0.jpg)](https://www.youtube.com/watch?v=393gPtzMN1k)
+
+## Build GPU support with CUDA
+
+In this section, we will extend the previous steps to enable GPU.
+
+### Install Dependencies
+
+In addition to the dependencies in section 1 above, we will need the following:
+
+- CUDA
+
+  Download a suitable version such as 9.1 from
+  https://developer.nvidia.com/cuda-downloads . Make sure to install the Visual
+  Studio integration module.
+
+- cuDNN
+
+  Download a suitable version such as 7.1 from
+  https://developer.nvidia.com/cudnn
+
+- cnmem:
+
+  - Download the latest version from https://github.com/NVIDIA/cnmem
+  - Build the Visual Studio solution:
+
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64"
+  ```
+
+  - Open the generated solution in Visual Studio.
+  - Change the build settings to Release and x64.
+  - Build the cnmem project.
+
+### Build SINGA source
+
+- Call cmake and add the paths in your system similar to the following example:
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64" ^
+    -DGLOG_INCLUDE_DIR="D:/WinSinga/dependencies/glog-0.3.5/src/windows" ^
+    -DGLOG_LIBRARIES="D:/WinSinga/dependencies/glog-0.3.5/x64/Release" ^
+    -DCBLAS_INCLUDE_DIR="D:/WinSinga/dependencies/openblas-0.2.20/lapack-netlib/CBLAS/include" ^
+    -DCBLAS_LIBRARIES="D:/WinSinga/dependencies/openblas-0.2.20/lib/RELEASE" ^
+    -DProtobuf_INCLUDE_DIR="D:/WinSinga/dependencies/protobuf-2.6.1/src" ^
+    -DProtobuf_LIBRARIES="D:\WinSinga/dependencies/protobuf-2.6.1/vsprojects/x64/Release" ^
+    -DProtobuf_PROTOC_EXECUTABLE="D:/WinSinga/dependencies/protoc-2.6.1-win32/protoc.exe" ^
+    -DCUDNN_INCLUDE_DIR=D:\WinSinga\dependencies\cudnn-9.1-windows10-x64-v7.1\cuda\include ^
+    -DCUDNN_LIBRARIES=D:\WinSinga\dependencies\cudnn-9.1-windows10-x64-v7.1\cuda\lib\x64 ^
+    -DSWIG_DIR=D:\WinSinga\dependencies\swigwin-3.0.12 ^
+    -DSWIG_EXECUTABLE=D:\WinSinga\dependencies\swigwin-3.0.12\swig.exe ^
+    -DUSE_CUDA=YES ^
+    -DCUDNN_VERSION=7 ^
+    ..
+  ```
+
+* Generate swig interfaces for C++ and Python: Go to src/api
+
+  ```shell
+  swig -python -c++ singa.i
+  ```
+
+* Open the generated solution in Visual Studio
+
+* Change the build settings to Release and x64
+
+#### Building singa_objects
+
+- Add the singa_wrap.cxx file from src/api to the singa_objects project
+- In the singa_objects project, open Additional Include Directories.
+- Add Python include path
+- Add numpy include path
+- Add protobuf include path
+- Add include path for CUDA, cuDNN and cnmem
+- In the preprocessor definitions of the singa_objects project, add USE_GLOG,
+  USE_CUDA and USE_CUDNN. Remove DISABLE_WARNINGS.
+- Build singa_objects project
+
+#### Building singa-kernel
+
+- Create a new Visual Studio project of type "CUDA 9.1 Runtime". Give it a name
+  such as singa-kernel.
+- The project comes with an initial file called kernel.cu. Remove this file from
+  the project.
+- Add this file: src/core/tensor/math_kernel.cu
+- In the project settings:
+
+  - Set Platform Toolset to "Visual Studio 2015 (v140)"
+  - Set Configuration Type to " Static Library (.lib)"
+  - In the Include Directories, add build/include.
+
+- Build singa-kernel project
+
+#### Building singa
+
+- In singa project:
+
+  - add singa_wrap.obj to Object Libraries
+  - change target name to \_singa_wrap
+  - change target extension to .pyd
+  - change configuration type to Dynamic Library (.dll)
+  - go to Additional Library Directories and add the path to python, openblas,
+    protobuf and glog libraries
+  - Add also the library path to singa-kernel, cnmem, cuda and cudnn.
+  - go to Additional Dependencies and add libopenblas.lib, libglog.lib and
+    libprotobuf.lib.
+  - Add also: singa-kernel.lib, cnmem.lib, cudnn.lib, cuda.lib , cublas.lib,
+    curand.lib and cudart.lib.
+
+- build singa project
+
+### Install Python module
+
+- Change \_singa_wrap.so to \_singa_wrap.pyd in build/python/setup.py
+- Copy the files in src/proto/python_out to build/python/singa/proto
+
+- Optionally create and activate a virtual environment:
+
+  ```shell
+  mkdir SingaEnv
+  virtualenv SingaEnv
+  SingaEnv\Scripts\activate
+  ```
+
+- go to build/python folder and run:
+
+  ```shell
+  python setup.py install
+  ```
+
+- Make \_singa_wrap.pyd, libglog.dll, libopenblas.dll, cnmem.dll, CUDA Runtime
+  (e.g. cudart64_91.dll) and cuDNN (e.g. cudnn64_7.dll) available by adding them
+  to the path or by copying them to singa package folder in the python
+  site-packages
+
+- Verify that SINGA is installed by running:
+
+  ```shell
+  python -c "from singa import device; dev = device.create_cuda_gpu()"
+  ```
+
+A video tutorial for this part can be found here:
+
+[![youtube video](https://img.youtube.com/vi/YasKVjRtuDs/0.jpg)](https://www.youtube.com/watch?v=YasKVjRtuDs)
+
+### Run Unit Tests
+
+- In the test folder, generate the Visual Studio solution:
+
+  ```shell
+  cmake -G "Visual Studio 15 2017 Win64"
+  ```
+
+- Open the generated solution in Visual Studio, or add the project to the singa
+  solution that was created in step 5.2
+
+- Change the build settings to Release and x64.
+
+- Build glog project.
+
+- In test_singa project:
+
+  - Add USE_GLOG; USE_CUDA; USE_CUDNN to the Preprocessor Definitions.
+  - In Additional Include Directories, add path of GLOG_INCLUDE_DIR,
+    CBLAS_INCLUDE_DIR and Protobuf_INCLUDE_DIR which were used in step 5.2
+    above. Add also build, build/include, CUDA and cuDNN include folders.
+  - Go to Additional Library Directories and add the path to openblas, protobuf
+    and glog libraries. Add also build/src/singa_objects.dir/Release,
+    singa-kernel, cnmem, CUDA and cuDNN library paths.
+  - Go to Additional Dependencies and add libopenblas.lib; libglog.lib;
+    libprotobuf.lib; cnmem.lib; cudnn.lib; cuda.lib; cublas.lib; curand.lib;
+    cudart.lib; singa-kernel.lib. Fix the names of the two libraries: gtest.lib
+    and singa_objects.lib.
+
+* Build test_singa project.
+
+* Make libglog.dll, libopenblas.dll, cnmem.dll, cudart64_91.dll and
+  cudnn64_7.dll available by adding them to the path or by copying them to
+  test/release folder
+
+* The unit tests can be executed
+
+  - From the command line:
+
+    ```shell
+    test_singa.exe
+    ```
+
+  - From Visual Studio:
+    - right click on the test_singa project and choose 'Set as StartUp Project'.
+    - from the Debug menu, choose 'Start Without Debugging'
+
+A video tutorial for running the unit tests can be found here:
+
+[![youtube video](https://img.youtube.com/vi/YOjwtrvTPn4/0.jpg)](https://www.youtube.com/watch?v=YOjwtrvTPn4)
diff --git a/docs-site/website/versioned_docs/version-3.1.0/installation.md b/docs-site/website/versioned_docs/version-3.1.0/installation.md
new file mode 100644
index 0000000..732bd79
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/installation.md
@@ -0,0 +1,184 @@
+---
+id: version-3.1.0-installation
+title: Installation
+original_id: installation
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+## Using Conda
+
+Conda is a package manager for Python, CPP and other packages.
+
+Currently, SINGA has conda packages for Linux and MacOSX.
+[Miniconda3](https://conda.io/miniconda.html) is recommended to use with SINGA.
+After installing miniconda, execute the one of the following commands to install
+SINGA.
+
+1. CPU only
+   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Ntkhi-Z6XTR8WYPXiLwujHd2dOm0772V?usp=sharing)
+
+```shell
+$ conda install -c nusdbsystem -c conda-forge singa-cpu
+```
+
+2. GPU with CUDA and cuDNN (CUDA driver >=384.81 is required)
+   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1do_TLJe18IthLOnBOsHCEe-FFPGk1sPJ?usp=sharing)
+
+```shell
+$ conda install -c nusdbsystem -c conda-forge singa-gpu
+```
+
+3. Install a specific version of SINGA. The following command lists all the
+   available SINGA packages.
+
+```shell
+$ conda search -c nusdbsystem singa
+
+Loading channels: done
+# Name                       Version           Build  Channel
+singa                      3.1.0.rc2        cpu_py36  nusdbsystem
+singa                      3.1.0.rc2 cudnn7.6.5_cuda10.2_py36  nusdbsystem
+singa                      3.1.0.rc2 cudnn7.6.5_cuda10.2_py37  nusdbsystem
+```
+
+<!--- > Please note that using the nightly built images is not recommended except for SINGA development and testing. Using stable releases is recommended. -->
+
+The following command installs a specific version of SINGA,
+
+```shell
+$ conda install -c nusdbsystem -c conda-forge singa=X.Y.Z=cpu_py36
+```
+
+If there is no error message from
+
+```shell
+$ python -c "from singa import tensor"
+```
+
+then SINGA is installed successfully.
+
+## Using Pip
+
+1. CPU only
+   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/17RA056Brwk0vBQTFaZ-l9EbqwADO0NA9?usp=sharing)
+
+```bash
+pip install singa -f http://singa.apache.org/docs/next/wheel-cpu.html --trusted-host singa.apache.org
+```
+
+You can install a specific version of SINGA via `singa==<version>`, where the
+`<version>` field should be replaced, e.g., `3.1.0`. The available SINGA
+versions are listed at the link.
+
+To install the latest develop version, replace the link with
+http://singa.apache.org/docs/next/wheel-cpu-dev.html
+
+2. GPU With CUDA and cuDNN
+   [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1W30IPCqj5fG8ADAQsFqclaCLyIclVcJL?usp=sharing)
+
+```bash
+pip install singa -f http://singa.apache.org/docs/next/wheel-gpu.html --trusted-host singa.apache.org
+```
+
+You can also configure SINGA version and the CUDA version, like
+`singa==3.1.0+cuda10.2`. The available combinations of SINGA version and CUDA
+version are listed at the link.
+
+To install the latest develop version, replace the link with
+http://singa.apache.org/docs/next/wheel-gpu-dev.html
+
+Note: the Python version of your local Python environment will be used to find
+the corresponding wheel package. For example, if your local Python is 3.6, then
+the wheel package compiled on Python 3.6 will be selected by pip and installed.
+In fact, the wheel file's name include SINGA version, CUDA version and Python
+version. Therefore, `pip` knows which wheel file to download and install.
+
+Refer to the comments at the top of the `setup.py` file for how to build the
+wheel packages.
+
+## Using Docker
+
+Install Docker on your local host machine following the
+[instructions](https://docs.docker.com/install/). Add your user into the
+[docker group](https://docs.docker.com/install/linux/linux-postinstall/) to run
+docker commands without `sudo`.
+
+1. CPU-only.
+
+```shell
+$ docker run -it apache/singa:X.Y.Z-cpu-ubuntu16.04 /bin/bash
+```
+
+2. With GPU enabled. Install
+   [Nvidia-Docker](https://github.com/NVIDIA/nvidia-docker) after install
+   Docker.
+
+```shell
+$ nvidia-docker run -it apache/singa:X.Y.Z-cuda9.0-cudnn7.4.2-ubuntu16.04 /bin/bash
+```
+
+3. For the complete list of SINGA Docker images (tags), visit the
+   [docker hub site](https://hub.docker.com/r/apache/singa/). For each docker
+   image, the tag is named as
+
+```shell
+version-(cpu|gpu)[-devel]
+```
+
+| Tag       | Description                      | Example value                                                                                                                                                             |
+| --------- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| `version` | SINGA version                    | '2.0.0-rc0', '2.0.0', '1.2.0'                                                                                                                                             |
+| `cpu`     | the image cannot run on GPUs     | 'cpu'                                                                                                                                                                     |
+| `gpu`     | the image can run on Nvidia GPUs | 'gpu', or 'cudax.x-cudnnx.x' e.g., 'cuda10.0-cudnn7.3'                                                                                                                    |
+| `devel`   | indicator for development        | if absent, SINGA Python package is installed for runtime only; if present, the building environment is also created, you can recompile SINGA from source at '/root/singa' |
+| `OS`      | indicate OS version number       | 'ubuntu16.04', 'ubuntu18.04'                                                                                                                                              |
+
+## From source
+
+You can [build and install SINGA](build.md) from the source code using native
+building tools or conda-build, on local host OS or in a Docker container.
+
+## FAQ
+
+- Q: Error from `from singa import tensor`
+
+  A: Check the detailed error from
+
+  ```shell
+  python -c  "from singa import _singa_wrap"
+  # go to the folder of _singa_wrap.so
+  ldd path to _singa_wrap.so
+  python
+  >> import importlib
+  >> importlib.import_module('_singa_wrap')
+  ```
+
+  The folder of `_singa_wrap.so` is like
+  `~/miniconda3/lib/python3.7/site-packages/singa`. Normally, the error is
+  caused by the mismatch or missing of dependent libraries, e.g. cuDNN or
+  protobuf. The solution is to create a new virtual environment and install
+  SINGA in that environment, e.g.,
+
+  ```shell
+  conda create -n singa
+  conda activate singa
+  conda install -c nusdbsystem -c conda-forge singa-cpu
+  ```
+
+- Q: When using virtual environment, every time I install SINGA, numpy would be
+  reinstalled. However, the numpy is not used when I run `import numpy`
+
+  A: It could be caused by the `PYTHONPATH` environment variable which should be
+  set to empty when you are using virtual environment to avoid the conflicts
+  with the path of the virtual environment.
+
+- Q: When I run SINGA in Mac OS X, I got the error "Fatal Python error:
+  PyThreadState_Get: no current thread Abort trap: 6"
+
+  A: This error happens typically when you have multiple versions of Python in
+  your system, e.g, the one comes with the OS and the one installed by Homebrew.
+  The Python linked by SINGA must be the same as the Python interpreter. You can
+  check your interpreter by `which python` and check the Python linked by SINGA
+  via `otool -L <path to _singa_wrap.so>`. This problem should be resolved if
+  SINGA is installed via conda.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/onnx.md b/docs-site/website/versioned_docs/version-3.1.0/onnx.md
new file mode 100644
index 0000000..5aae4b2
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/onnx.md
@@ -0,0 +1,770 @@
+---
+id: version-3.1.0-onnx
+title: ONNX
+original_id: onnx
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+[ONNX](https://onnx.ai/) is an open representation format for machine learning
+models, which enables AI developers to use models across different libraries and
+tools. SINGA supports loading ONNX format models for training and inference, and
+saving models defined using SINGA APIs (e.g., [Module](./module)) into ONNX
+format.
+
+SINGA has been tested with the following
+[version](https://github.com/onnx/onnx/blob/master/docs/Versioning.md) of ONNX.
+
+| ONNX version | File format version | Opset version ai.onnx | Opset version ai.onnx.ml | Opset version ai.onnx.training |
+| ------------ | ------------------- | --------------------- | ------------------------ | ------------------------------ |
+| 1.6.0        | 6                   | 11                    | 2                        | -                              |
+
+## General usage
+
+### Loading an ONNX Model into SINGA
+
+After loading an ONNX model from disk by `onnx.load`, you need to update the
+model's batchsize, since for most models, they use a placeholder to represent
+its batchsize. We give an example here, as `update_batch_size`. You only need to
+update the batchsize of input and output, the shape of internal tensors will be
+inferred automatically.
+
+Then, you can prepare the SINGA model by using `sonnx.prepare`. This function
+iterates and translates all the nodes within the ONNX model's graph into SINGA
+operators, loads all stored weights and infers each intermediate tensor's shape.
+
+```python3
+import onnx
+from singa import device
+from singa import sonnx
+
+# if the input has multiple tensors? can put this function inside prepare()?
+def update_batch_size(onnx_model, batch_size):
+    model_input = onnx_model.graph.input[0]
+    model_input.type.tensor_type.shape.dim[0].dim_value = batch_size
+    model_output = onnx_model.graph.output[0]
+    model_output.type.tensor_type.shape.dim[0].dim_value = batch_size
+    return onnx_model
+
+
+model_path = "PATH/To/ONNX/MODEL"
+onnx_model = onnx.load(model_path)
+
+# set batch size
+onnx_model = update_batch_size(onnx_model, 1)
+
+# convert onnx graph nodes into SINGA operators
+dev = device.create_cuda_gpu()
+sg_ir = sonnx.prepare(onnx_model, device=dev)
+```
+
+### Inference SINGA model
+
+Once the model is created, you can do inference by calling `sg_ir.run`. The
+input and output must be SINGA `Tensor` instances. Since SINGA model returns the
+output as a list, if there is only one output, you just need to take the first
+element from the output.
+
+```python3
+# can warp the following code in prepare()
+# and provide a flag training=True/False?
+
+class Infer:
+
+
+    def __init__(self, sg_ir):
+        self.sg_ir = sg_ir
+
+    def forward(self, x):
+        return sg_ir.run([x])[0]
+
+
+data = get_dataset()
+x = tensor.Tensor(device=dev, data=data)
+
+model = Infer(sg_ir)
+y = model.forward(x)
+```
+
+### Saving SINGA model into ONNX Format
+
+Given the input tensors and the output tensors generated by the operators the
+model, you can trace back all internal operations. Therefore, a SINGA model is
+defined by the input and outputs tensors. To export a SINGA model into ONNX
+format, you just need to provide the input and output tensor list.
+
+```python3
+# x is the input tensor, y is the output tensor
+sonnx.to_onnx([x], [y])
+```
+
+### Re-training an ONNX model
+
+To train (or refine) an ONNX model using SINGA, you need to set the internal
+tensors to be trainable
+
+```python3
+class Infer:
+
+    def __init__(self, sg_ir):
+        self.sg_ir = sg_ir
+        ## can wrap these codes in sonnx?
+        for idx, tens in sg_ir.tensor_map.items():
+            # allow the tensors to be updated
+            tens.requires_grad = True
+            tens.stores_grad = True
+
+    def forward(self, x):
+        return sg_ir.run([x])[0]
+
+autograd.training = False
+model = Infer(sg_ir)
+
+autograd.training = True
+# then you training the model like normal
+# give more details??
+```
+
+### Transfer-learning an ONNX model
+
+You also can append some layers to the end of ONNX model to do
+transfer-learning. The `last_layers` means you cut the ONNX layers from [0,
+last_layers]. Then you can append more layers by the normal SINGA model.
+
+```python3
+class Trans:
+
+    def __init__(self, sg_ir, last_layers):
+        self.sg_ir = sg_ir
+        self.last_layers = last_layers
+        self.append_linear1 = autograd.Linear(500, 128, bias=False)
+        self.append_linear2 = autograd.Linear(128, 32, bias=False)
+        self.append_linear3 = autograd.Linear(32, 10, bias=False)
+
+    def forward(self, x):
+        y = sg_ir.run([x], last_layers=self.last_layers)[0]
+        y = self.append_linear1(y)
+        y = autograd.relu(y)
+        y = self.append_linear2(y)
+        y = autograd.relu(y)
+        y = self.append_linear3(y)
+        y = autograd.relu(y)
+        return y
+
+autograd.training = False
+model = Trans(sg_ir, -1)
+
+# then you training the model like normal
+```
+
+## A Full Example
+
+This part introduces the usage of SINGA ONNX by using the mnist example. In this
+section, the examples of how to export, load, inference, re-training, and
+transfer-learning the minist model are displayed. You can try this part
+[here](https://colab.research.google.com/drive/1-YOfQqqw3HNhS8WpB8xjDQYutRdUdmCq).
+
+### Load dataset
+
+Firstly, you need to import some necessary libraries and define some auxiliary
+functions for downloading and preprocessing the dataset:
+
+```python
+import os
+import urllib.request
+import gzip
+import numpy as np
+import codecs
+
+from singa import device
+from singa import tensor
+from singa import opt
+from singa import autograd
+from singa import sonnx
+import onnx
+
+
+def load_dataset():
+    train_x_url = 'http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz'
+    train_y_url = 'http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz'
+    valid_x_url = 'http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz'
+    valid_y_url = 'http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz'
+    train_x = read_image_file(check_exist_or_download(train_x_url)).astype(
+        np.float32)
+    train_y = read_label_file(check_exist_or_download(train_y_url)).astype(
+        np.float32)
+    valid_x = read_image_file(check_exist_or_download(valid_x_url)).astype(
+        np.float32)
+    valid_y = read_label_file(check_exist_or_download(valid_y_url)).astype(
+        np.float32)
+    return train_x, train_y, valid_x, valid_y
+
+
+def check_exist_or_download(url):
+
+    download_dir = '/tmp/'
+
+    name = url.rsplit('/', 1)[-1]
+    filename = os.path.join(download_dir, name)
+    if not os.path.isfile(filename):
+        print("Downloading %s" % url)
+        urllib.request.urlretrieve(url, filename)
+    return filename
+
+
+def read_label_file(path):
+    with gzip.open(path, 'rb') as f:
+        data = f.read()
+        assert get_int(data[:4]) == 2049
+        length = get_int(data[4:8])
+        parsed = np.frombuffer(data, dtype=np.uint8, offset=8).reshape(
+            (length))
+        return parsed
+
+
+def get_int(b):
+    return int(codecs.encode(b, 'hex'), 16)
+
+
+def read_image_file(path):
+    with gzip.open(path, 'rb') as f:
+        data = f.read()
+        assert get_int(data[:4]) == 2051
+        length = get_int(data[4:8])
+        num_rows = get_int(data[8:12])
+        num_cols = get_int(data[12:16])
+        parsed = np.frombuffer(data, dtype=np.uint8, offset=16).reshape(
+            (length, 1, num_rows, num_cols))
+        return parsed
+
+
+def to_categorical(y, num_classes):
+    y = np.array(y, dtype="int")
+    n = y.shape[0]
+    categorical = np.zeros((n, num_classes))
+    categorical[np.arange(n), y] = 1
+    categorical = categorical.astype(np.float32)
+    return categorical
+```
+
+### MNIST model
+
+Then you can define a class called **CNN** to construct the mnist model which
+consists of several convolution, pooling, fully connection and relu layers. You
+can also define a function to calculate the **accuracy** of our result. Finally,
+you can define a **train** and a **test** function to handle the training and
+prediction process.
+
+```python
+class CNN:
+    def __init__(self):
+        self.conv1 = autograd.Conv2d(1, 20, 5, padding=0)
+        self.conv2 = autograd.Conv2d(20, 50, 5, padding=0)
+        self.linear1 = autograd.Linear(4 * 4 * 50, 500, bias=False)
+        self.linear2 = autograd.Linear(500, 10, bias=False)
+        self.pooling1 = autograd.MaxPool2d(2, 2, padding=0)
+        self.pooling2 = autograd.MaxPool2d(2, 2, padding=0)
+
+    def forward(self, x):
+        y = self.conv1(x)
+        y = autograd.relu(y)
+        y = self.pooling1(y)
+        y = self.conv2(y)
+        y = autograd.relu(y)
+        y = self.pooling2(y)
+        y = autograd.flatten(y)
+        y = self.linear1(y)
+        y = autograd.relu(y)
+        y = self.linear2(y)
+        return y
+
+
+def accuracy(pred, target):
+    y = np.argmax(pred, axis=1)
+    t = np.argmax(target, axis=1)
+    a = y == t
+    return np.array(a, "int").sum() / float(len(t))
+
+
+def train(model,
+          x,
+          y,
+          epochs=1,
+          batch_size=64,
+          dev=device.get_default_device()):
+    batch_number = x.shape[0] // batch_size
+
+    for i in range(epochs):
+        for b in range(batch_number):
+            l_idx = b * batch_size
+            r_idx = (b + 1) * batch_size
+
+            x_batch = tensor.Tensor(device=dev, data=x[l_idx:r_idx])
+            target_batch = tensor.Tensor(device=dev, data=y[l_idx:r_idx])
+
+            output_batch = model.forward(x_batch)
+            # onnx_model = sonnx.to_onnx([x_batch], [y])
+            # print('The model is:\n{}'.format(onnx_model))
+
+            loss = autograd.softmax_cross_entropy(output_batch, target_batch)
+            accuracy_rate = accuracy(tensor.to_numpy(output_batch),
+                                     tensor.to_numpy(target_batch))
+
+            sgd = opt.SGD(lr=0.001)
+            for p, gp in autograd.backward(loss):
+                sgd.update(p, gp)
+            sgd.step()
+
+            if b % 1e2 == 0:
+                print("acc %6.2f loss, %6.2f" %
+                      (accuracy_rate, tensor.to_numpy(loss)[0]))
+    print("training completed")
+    return x_batch, output_batch
+
+def test(model, x, y, batch_size=64, dev=device.get_default_device()):
+    batch_number = x.shape[0] // batch_size
+
+    result = 0
+    for b in range(batch_number):
+        l_idx = b * batch_size
+        r_idx = (b + 1) * batch_size
+
+        x_batch = tensor.Tensor(device=dev, data=x[l_idx:r_idx])
+        target_batch = tensor.Tensor(device=dev, data=y[l_idx:r_idx])
+
+        output_batch = model.forward(x_batch)
+        result += accuracy(tensor.to_numpy(output_batch),
+                           tensor.to_numpy(target_batch))
+
+    print("testing acc %6.2f" % (result / batch_number))
+```
+
+### Train mnist model and export it to onnx
+
+Now, you can train the mnist model and export its onnx model by calling the
+**soonx.to_onnx** function.
+
+```python
+def make_onnx(x, y):
+    return sonnx.to_onnx([x], [y])
+
+# create device
+dev = device.create_cuda_gpu()
+#dev = device.get_default_device()
+# create model
+model = CNN()
+# load data
+train_x, train_y, valid_x, valid_y = load_dataset()
+# normalization
+train_x = train_x / 255
+valid_x = valid_x / 255
+train_y = to_categorical(train_y, 10)
+valid_y = to_categorical(valid_y, 10)
+# do training
+autograd.training = True
+x, y = train(model, train_x, train_y, dev=dev)
+onnx_model = make_onnx(x, y)
+# print('The model is:\n{}'.format(onnx_model))
+
+# Save the ONNX model
+model_path = os.path.join('/', 'tmp', 'mnist.onnx')
+onnx.save(onnx_model, model_path)
+print('The model is saved.')
+```
+
+### Inference
+
+After you export the onnx model, you can find a file called **mnist.onnx** in
+the '/tmp' directory, this model, therefore, can be imported by other libraries.
+Now, if you want to import this onnx model into singa again and do the inference
+using the validation dataset, you can define a class called **Infer**, the
+forward function of Infer will be called by the test function to do inference
+for validation dataset. By the way, you should set the label of training to
+**False** to fix the gradient of autograd operators.
+
+When import the onnx model, you need to call **onnx.load** to load the onnx
+model firstly. Then the onnx model will be fed into the **soonx.prepare** to
+parse and initiate to a singa model(**sg_ir** in the code). The sg_ir contains a
+singa graph within it, and then you can run an step of inference by feeding
+input to its run function.
+
+```python
+class Infer:
+    def __init__(self, sg_ir):
+        self.sg_ir = sg_ir
+        for idx, tens in sg_ir.tensor_map.items():
+            # allow the tensors to be updated
+            tens.requires_grad = True
+            tens.stores_grad= True
+            sg_ir.tensor_map[idx] = tens
+
+    def forward(self, x):
+        return sg_ir.run([x])[0] # we can run one step of inference by feeding input
+
+# load the ONNX model
+onnx_model = onnx.load(model_path)
+sg_ir = sonnx.prepare(onnx_model, device=dev) # parse and initiate to a singa model
+
+# inference
+autograd.training = False
+print('The inference result is:')
+test(Infer(sg_ir), valid_x, valid_y, dev=dev)
+```
+
+### Re-training
+
+Assume after import the model, you want to re-train the model again, we can
+define a function called **re_train**. Before we call this re_train function, we
+should set the label of training to **True** to make the autograde operators
+update their gradient. And after we finish the training, we set it as **False**
+again to call the test function doing inference.
+
+```python
+def re_train(sg_ir,
+             x,
+             y,
+             epochs=1,
+             batch_size=64,
+             dev=device.get_default_device()):
+    batch_number = x.shape[0] // batch_size
+
+    new_model = Infer(sg_ir)
+
+    for i in range(epochs):
+        for b in range(batch_number):
+            l_idx = b * batch_size
+            r_idx = (b + 1) * batch_size
+
+            x_batch = tensor.Tensor(device=dev, data=x[l_idx:r_idx])
+            target_batch = tensor.Tensor(device=dev, data=y[l_idx:r_idx])
+
+            output_batch = new_model.forward(x_batch)
+
+            loss = autograd.softmax_cross_entropy(output_batch, target_batch)
+            accuracy_rate = accuracy(tensor.to_numpy(output_batch),
+                                     tensor.to_numpy(target_batch))
+
+            sgd = opt.SGD(lr=0.01)
+            for p, gp in autograd.backward(loss):
+                sgd.update(p, gp)
+            sgd.step()
+
+            if b % 1e2 == 0:
+                print("acc %6.2f loss, %6.2f" %
+                      (accuracy_rate, tensor.to_numpy(loss)[0]))
+    print("re-training completed")
+    return new_model
+
+# load the ONNX model
+onnx_model = onnx.load(model_path)
+sg_ir = sonnx.prepare(onnx_model, device=dev)
+
+# re-training
+autograd.training = True
+new_model = re_train(sg_ir, train_x, train_y, dev=dev)
+autograd.training = False
+test(new_model, valid_x, valid_y, dev=dev)
+```
+
+### Transfer learning
+
+Finally, if we want to do transfer-learning, we can define a function called
+**Trans** to append some layers after the onnx model. For demonstration, the
+code only appends several linear(fully connection) and relu after the onnx
+model. You can define a transfer_learning function to handle the training
+process of the transfer-learning model. And the label of training is the same as
+the previous one.
+
+```python
+class Trans:
+    def __init__(self, sg_ir, last_layers):
+        self.sg_ir = sg_ir
+        self.last_layers = last_layers
+        self.append_linear1 = autograd.Linear(500, 128, bias=False)
+        self.append_linear2 = autograd.Linear(128, 32, bias=False)
+        self.append_linear3 = autograd.Linear(32, 10, bias=False)
+
+    def forward(self, x):
+        y = sg_ir.run([x], last_layers=self.last_layers)[0]
+        y = self.append_linear1(y)
+        y = autograd.relu(y)
+        y = self.append_linear2(y)
+        y = autograd.relu(y)
+        y = self.append_linear3(y)
+        y = autograd.relu(y)
+        return y
+
+def transfer_learning(sg_ir,
+             x,
+             y,
+             epochs=1,
+             batch_size=64,
+             dev=device.get_default_device()):
+    batch_number = x.shape[0] // batch_size
+
+    trans_model = Trans(sg_ir, -1)
+
+    for i in range(epochs):
+        for b in range(batch_number):
+            l_idx = b * batch_size
+            r_idx = (b + 1) * batch_size
+
+            x_batch = tensor.Tensor(device=dev, data=x[l_idx:r_idx])
+            target_batch = tensor.Tensor(device=dev, data=y[l_idx:r_idx])
+            output_batch = trans_model.forward(x_batch)
+
+            loss = autograd.softmax_cross_entropy(output_batch, target_batch)
+            accuracy_rate = accuracy(tensor.to_numpy(output_batch),
+                                     tensor.to_numpy(target_batch))
+
+            sgd = opt.SGD(lr=0.07)
+            for p, gp in autograd.backward(loss):
+                sgd.update(p, gp)
+            sgd.step()
+
+            if b % 1e2 == 0:
+                print("acc %6.2f loss, %6.2f" %
+                      (accuracy_rate, tensor.to_numpy(loss)[0]))
+    print("transfer-learning completed")
+    return trans_mode
+
+# load the ONNX model
+onnx_model = onnx.load(model_path)
+sg_ir = sonnx.prepare(onnx_model, device=dev)
+
+# transfer-learning
+autograd.training = True
+new_model = transfer_learning(sg_ir, train_x, train_y, dev=dev)
+autograd.training = False
+test(new_model, valid_x, valid_y, dev=dev)
+```
+
+## ONNX model zoo
+
+The [ONNX Model Zoo](https://github.com/onnx/models) is a collection of
+pre-trained, state-of-the-art models in the ONNX format contributed by community
+members. SINGA has supported several CV and NLP models now. More models are
+going to be supported soon.
+
+### Image Classification
+
+This collection of models take images as input, then classifies the major
+objects in the images into 1000 object categories such as keyboard, mouse,
+pencil, and many animals.
+
+| Model Class                                                                                         | Reference                                               | Description                                                                                                                                                                                                                               | Link                                                                                                                                                    |
+| --------------------------------------------------------------------------------------------------- | ------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <b>[MobileNet](https://github.com/onnx/models/tree/master/vision/classification/mobilenet)</b>      | [Sandler et al.](https://arxiv.org/abs/1801.04381)      | Light-weight deep neural network best suited for mobile and embedded vision applications. <br>Top-5 error from paper - ~10%                                                                                                               | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1HsixqJMIpKyEPhkbB8jy7NwNEFEAUWAf) |
+| <b>[ResNet18](https://github.com/onnx/models/tree/master/vision/classification/resnet)</b>          | [He et al.](https://arxiv.org/abs/1512.03385)           | A CNN model (up to 152 layers). Uses shortcut connections to achieve higher accuracy when classifying images. <br> Top-5 error from paper - ~3.6%                                                                                         | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1u1RYefSsVbiP4I-5wiBKHjsT9L0FxLm9) |
+| <b>[VGG16](https://github.com/onnx/models/tree/master/vision/classification/vgg)</b>                | [Simonyan et al.](https://arxiv.org/abs/1409.1556)      | Deep CNN model(up to 19 layers). Similar to AlexNet but uses multiple smaller kernel-sized filters that provides more accuracy when classifying images. <br>Top-5 error from paper - ~8%                                                  | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/14kxgRKtbjPCKKsDJVNi3AvTev81Gp_Ds) |
+| <b>[ShuffleNet_V2](https://github.com/onnx/models/tree/master/vision/classification/shufflenet)</b> | [Simonyan et al.](https://arxiv.org/pdf/1707.01083.pdf) | Extremely computation efficient CNN model that is designed specifically for mobile devices. This network architecture design considers direct metric such as speed, instead of indirect metric like FLOP. Top-1 error from paper - ~30.6% | [![Open In Colab](https://colab.research.google.com/drive/19HfRu3YHP_H2z3BcZujVFRp23_J5XsuA?usp=sharing)                                                |
+
+### Object Detection
+
+Object detection models detect the presence of multiple objects in an image and
+segment out areas of the image where the objects are detected.
+
+| Model Class                                                                                                       | Reference                                             | Description                                                                                                                        | Link                                                                                                                                                    |
+| ----------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <b>[Tiny YOLOv2](https://github.com/onnx/models/tree/master/vision/object_detection_segmentation/tiny_yolov2)</b> | [Redmon et al.](https://arxiv.org/pdf/1612.08242.pdf) | A real-time CNN for object detection that detects 20 different classes. A smaller version of the more complex full YOLOv2 network. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/11V4I6cRjIJNUv5ZGsEGwqHuoQEie6b1T) |
+
+### Face Analysis
+
+Face detection models identify and/or recognize human faces and emotions in
+given images.
+
+| Model Class                                                                                               | Reference                                          | Description                                                                                                                         | Link                                                                                                                                                    |
+| --------------------------------------------------------------------------------------------------------- | -------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <b>[ArcFace](https://github.com/onnx/models/tree/master/vision/body_analysis/arcface)</b>                 | [Deng et al.](https://arxiv.org/abs/1801.07698)    | A CNN based model for face recognition which learns discriminative features of faces and produces embeddings for input face images. | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1qanaqUKGIDtifdzEzJOHjEj4kYzA9uJC) |
+| <b>[Emotion FerPlus](https://github.com/onnx/models/tree/master/vision/body_analysis/emotion_ferplus)</b> | [Barsoum et al.](https://arxiv.org/abs/1608.01041) | Deep CNN for emotion recognition trained on images of faces.                                                                        | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1XHtBQGRhe58PDi4LGYJzYueWBeWbO23r) |
+
+### Machine Comprehension
+
+This subset of natural language processing models that answer questions about a
+given context paragraph.
+
+| Model Class                                                                                           | Reference                                                                                                                           | Description                                                                                                       | Link                                                                                                                                                                |
+| ----------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| <b>[BERT-Squad](https://github.com/onnx/models/tree/master/text/machine_comprehension/bert-squad)</b> | [Devlin et al.](https://arxiv.org/pdf/1810.04805.pdf)                                                                               | This model answers questions based on the context of the given input paragraph.                                   | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1kud-lUPjS_u-TkDAzihBTw0Vqr0FjCE-)             |
+| <b>[RoBERTa](https://github.com/onnx/models/tree/master/text/machine_comprehension/roberta)</b>       | [Devlin et al.](https://arxiv.org/pdf/1907.11692.pdf)                                                                               | A large transformer-based model that predicts sentiment based on given input text.                                | [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1F-c4LJSx3Cb2jW6tP7f8nAZDigyLH6iN?usp=sharing) |
+| <b>[GPT-2](https://github.com/onnx/models/tree/master/text/machine_comprehension/gpt-2)</b>           | [Devlin et al.](https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf) | A large transformer-based language model that given a sequence of words within some text, predicts the next word. | [![Open In Colab](https://colab.research.google.com/drive/1ZlXLSIMppPch6HgzKRillJiUcWn3PiK7?usp=sharing)                                                            |
+
+## Supported operators
+
+The following operators are supported:
+
+- Acos
+- Acosh
+- Add
+- And
+- Asin
+- Asinh
+- Atan
+- Atanh
+- AveragePool
+- BatchNormalization
+- Cast
+- Ceil
+- Clip
+- Concat
+- ConstantOfShape
+- Conv
+- Cos
+- Cosh
+- Div
+- Dropout
+- Elu
+- Equal
+- Erf
+- Expand
+- Flatten
+- Gather
+- Gemm
+- GlobalAveragePool
+- Greater
+- HardSigmoid
+- Identity
+- LeakyRelu
+- Less
+- Log
+- MatMul
+- Max
+- MaxPool
+- Mean
+- Min
+- Mul
+- Neg
+- NonZero
+- Not
+- OneHot
+- Or
+- Pad
+- Pow
+- PRelu
+- Reciprocal
+- ReduceMean
+- ReduceSum
+- Relu
+- Reshape
+- ScatterElements
+- Selu
+- Shape
+- Sigmoid
+- Sign
+- Sin
+- Sinh
+- Slice
+- Softmax
+- Softplus
+- Softsign
+- Split
+- Sqrt
+- Squeeze
+- Sub
+- Sum
+- Tan
+- Tanh
+- Tile
+- Transpose
+- Unsqueeze
+- Upsample
+- Where
+- Xor
+
+### Special comments for ONNX backend
+
+- Conv, MaxPool and AveragePool
+
+  Input must be 1d`(N*C*H)` and 2d(`N*C*H*W`) shape and `dilation` must be 1.
+
+- BatchNormalization
+
+  `epsilon` is 1e-05 and cannot be changed.
+
+- Cast
+
+  Only support float32 and int32, other types are casted to these two types.
+
+- Squeeze and Unsqueeze
+
+  If you encounter errors when you `Squeeze` or `Unsqueeze` between `Tensor` and
+  Scalar, please report to us.
+
+- Empty tensor Empty tensor is illegal in SINGA.
+
+## Implementation
+
+The code of SINGA ONNX locates at `python/singa/soonx.py`. There are three main
+class, `SingaFrontend` and `SingaBackend` and `SingaRep`. `SingaFrontend`
+translates a SINGA model to ONNX model; `SingaBackend` translates a ONNX model
+to `SingaRep` object which stores all SINGA operators and tensors(the tensor in
+this doc means SINGA `Tensor`); `SingaRep` can be run like a SINGA model.
+
+### SingaFrontend
+
+The entry function of `SingaFrontend` is `singa_to_onnx_model` which also is
+called `to_onnx`. `singa_to_onnx_model` creates the ONNX model, and it also
+create a ONNX graph by using `singa_to_onnx_graph`.
+
+`singa_to_onnx_graph` accepts the output of the model, and recursively iterate
+the SINGA model's graph from the output to get all operators to form a queue.
+The input and intermediate tensors, i.e, trainable weights, of the SINGA model
+is picked up at the same time. The input is stored in `onnx_model.graph.input`;
+the output is stored in `onnx_model.graph.output`; and the trainable weights are
+stored in `onnx_model.graph.initializer`.
+
+Then the SINGA operator in the queue is translated to ONNX operators one by one.
+`_rename_operators` defines the operators name mapping between SINGA and ONNX.
+`_special_operators` defines which function to be used to translate the
+operator.
+
+In addition, some operators in SINGA has different definition with ONNX, that
+is, ONNX regards some attributes of SINGA operators as input, so
+`_unhandled_operators` defines which function to handle the special operator.
+
+Since the bool type is regarded as int32 in SINGA, `_bool_operators` defines the
+operators to be changed as bool type.
+
+### SingaBackend
+
+The entry function of `SingaBackend` is `prepare` which checks the version of
+ONNX model and call `_onnx_model_to_singa_net` then.
+
+The purpose of `_onnx_model_to_singa_net` is to get SINGA tensors and operators.
+The tensors are stored in a dictionary by their name in ONNX, and operators are
+stored in queue by the form of
+`namedtuple('SingaOps', ['name', 'op', 'handle', 'forward'])`. For each
+operator, `name` is its ONNX node name; `op` is the ONNX node; `forward` is the
+SINGA operator's forward function; `handle` is prepared for some special
+operators such as Conv and Pooling which has `handle` object.
+
+The first step of `_onnx_model_to_singa_net` is to call `_init_graph_parameter`
+to get all tensors within the model. For trainable weights, it can init SINGA
+`Tensor` from `onnx_model.graph.initializer`. Please note, the weights may also
+be stored within graph's input or a ONNX node called `Constant`, SINGA can also
+handle these.
+
+Though all weights are stored within ONNX model, the input of the model is
+unknown but its shape and type. So SINGA support two ways to init input, 1,
+generate random tensor by its shape and type, 2, allow the user to assign the
+input. The first way works fine for most models, however, for some model such as
+bert, the indices of matrix cannot be random generated otherwise it will incurs
+errors.
+
+Then, `_onnx_model_to_singa_net` iterators all nodes within ONNX graph to
+translate it to SIGNA operators. Also, `_rename_operators` defines the operators
+name mapping between SINGA and ONNX. `_special_operators` defines which function
+to be used to translate the operator. `_run_node` runs the generated SINGA model
+by its input tensors and store its output tensors for being used by later
+operators.
+
+This class finally return a `SingaRep` object and stores all SINGA tensors and
+operators within it.
+
+### SingaRep
+
+`SingaBackend` stores all SINGA tensors and operators. `run` accepts the input
+of the model and run the SINGA operators one by one following the operators
+queue. The user can use `last_layers` to decide to run the model till the last
+few layers. Set `all_outputs` as `False` to get only the final output, `True` to
+also get all the intermediate output.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/optimizer.md b/docs-site/website/versioned_docs/version-3.1.0/optimizer.md
new file mode 100644
index 0000000..464a493
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/optimizer.md
@@ -0,0 +1,128 @@
+---
+id: version-3.1.0-optimizer
+title: Optimizer
+original_id: optimizer
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+SINGA supports various popular optimizers including stochastic gradient descent
+with momentum, Adam, RMSProp, and AdaGrad, etc. For each of the optimizer, it
+supports to use a decay schedular to schedule the learning rate to be applied in
+different epochs. The optimizers and the decay schedulers are included in
+`singa/opt.py`.
+
+## Create an optimizer
+
+1. SGD with momentum
+
+```python
+# define hyperparameter learning rate
+lr = 0.001
+# define hyperparameter momentum
+momentum = 0.9
+# define hyperparameter weight decay
+weight_decay = 0.0001
+
+from singa import opt
+sgd = opt.SGD(lr=lr, momentum=momentum, weight_decay=weight_decay)
+```
+
+2. RMSProp
+
+```python
+# define hyperparameter learning rate
+lr = 0.001
+# define hyperparameter rho
+rho = 0.9
+# define hyperparameter epsilon
+epsilon = 1e-8
+# define hyperparameter weight decay
+weight_decay = 0.0001
+
+from singa import opt
+sgd = opt.RMSProp(lr=lr, rho=rho, epsilon=epsilon, weight_decay=weight_decay)
+```
+
+3. AdaGrad
+
+```python
+# define hyperparameter learning rate
+lr = 0.001
+# define hyperparameter epsilon
+epsilon = 1e-8
+# define hyperparameter weight decay
+weight_decay = 0.0001
+
+from singa import opt
+sgd = opt.AdaGrad(lr=lr, epsilon=epsilon, weight_decay=weight_decay)
+```
+
+4. Adam
+
+```python
+# define hyperparameter learning rate
+lr = 0.001
+# define hyperparameter beta 1
+beta_1= 0.9
+# define hyperparameter beta 2
+beta_1= 0.999
+# define hyperparameter epsilon
+epsilon = 1e-8
+# define hyperparameter weight decay
+weight_decay = 0.0001
+
+from singa import opt
+sgd = opt.Adam(lr=lr, beta_1=beta_1, beta_2=beta_2, epsilon=epsilon, weight_decay=weight_decay)
+```
+
+## Create a Decay Scheduler
+
+```python
+from singa import opt
+
+# define initial learning rate
+lr_init = 0.001
+# define the rate of decay in the decay scheduler
+decay_rate = 0.95
+# define whether the learning rate schedule is a staircase shape
+staircase=True
+# define the decay step of the decay scheduler (in this example the lr is decreased at every 2 steps)
+decay_steps = 2
+
+# create the decay scheduler, the schedule of lr becomes lr_init * (decay_rate ^ (step // decay_steps) )
+lr = opt.ExponentialDecay(0.1, 2, 0.5, True)
+# Use the lr to create an optimizer
+sgd = opt.SGD(lr=lr, momentum=0.9, weight_decay=0.0001)
+```
+
+## Use the optimizer in Model API
+
+When we create the model, we need to attach the optimizer to the model.
+
+```python
+# create a CNN using the Model API
+model = CNN()
+
+# initialize optimizer and attach it to the model
+sgd = opt.SGD(lr=0.005, momentum=0.9, weight_decay=1e-5)
+model.set_optimizer(sgd)
+```
+
+Then, when we call the model, it runs the `train_one_batch` method that utilizes
+the optimizer.
+
+Hence, an example of an iterative loop to optimize the model is:
+
+```python
+for b in range(num_train_batch):
+    # generate the next mini-batch
+    x, y = ...
+
+    # Copy the data into input tensors
+    tx.copy_from_numpy(x)
+    ty.copy_from_numpy(y)
+
+    # Training with one batch
+    out, loss = model(tx, ty)
+```
diff --git a/docs-site/website/versioned_docs/version-3.1.0/releases/RELEASE_NOTES_3.0.0.md b/docs-site/website/versioned_docs/version-3.1.0/releases/RELEASE_NOTES_3.0.0.md
new file mode 100644
index 0000000..07e23bb
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/releases/RELEASE_NOTES_3.0.0.md
@@ -0,0 +1,106 @@
+---
+id: version-3.1.0-RELEASE_NOTES_3.0.0
+title: Apache SINGA-3.0.0 Release Notes
+original_id: RELEASE_NOTES_3.0.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+SINGA is a distributed deep learning library.
+
+This release includes following changes:
+
+- Code quality has been promoted by introducing linting check in CI and auto
+  code formatter. For linting, the tools, `cpplint` and `pylint`, are used and
+  configured to comply
+  [google coding styles](http://google.github.io/styleguide/) details in
+  `tool/linting/`. Similarly, formatting tools, `clang-format` and `yapf`
+  configured with google coding styles, are the recommended one for developers
+  to clean code before submitting changes, details in `tool/code-format/`.
+  [LGTM](https://lgtm.com) is enabled on Github for code quality check; License
+  check is also enabled.
+
+- New Tensor APIs are added for naming consistency, and feature enhancement:
+
+  - size(), mem_size(), get_value(), to_proto(), l1(), l2(): added for the sake
+    of naming consistency
+  - AsType(): convert data type between `float` and `int`
+  - ceil(): perform element-wise ceiling of the input
+  - concat(): concatenate two tensor
+  - index selector: e.g. tensor1[:,:,1:,1:]
+  - softmax(in, axis): allow to perform softmax on a axis on a multi-dimensional
+    tensor
+
+- 14 new operators are added into the autograd module: Gemm, GlobalAveragePool,
+  ConstantOfShape, Dropout, ReduceSum, ReduceMean, Slice, Ceil, Split, Gather,
+  Tile, NonZero, Cast, OneHot. Their unit tests are added as well.
+
+- 14 new operators are added to sonnx module for both backend and frontend:
+  [Gemm](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Gemm),
+  [GlobalAveragePool](https://github.com/onnx/onnx/blob/master/docs/Operators.md#GlobalAveragePool),
+  [ConstantOfShape](https://github.com/onnx/onnx/blob/master/docs/Operators.md#ConstantOfShape),
+  [Dropout](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Dropout),
+  [ReduceSum](https://github.com/onnx/onnx/blob/master/docs/Operators.md#ReduceSum),
+  [ReduceMean](https://github.com/onnx/onnx/blob/master/docs/Operators.md#ReduceMean),
+  [Slice](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Slice),
+  [Ceil](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Ceil),
+  [Split](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Split),
+  [Gather](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Gather),
+  [Tile](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Tile),
+  [NonZero](https://github.com/onnx/onnx/blob/master/docs/Operators.md#NonZero),
+  [Cast](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Cast),
+  [OneHot](https://github.com/onnx/onnx/blob/master/docs/Operators.md#OneHot).
+  Their tests are added as well.
+
+- Some ONNX models are imported into SINGA, including
+  [Bert-squad](https://github.com/onnx/models/tree/master/text/machine_comprehension/bert-squad),
+  [Arcface](https://github.com/onnx/models/tree/master/vision/body_analysis/arcface),
+  [FER+ Emotion](https://github.com/onnx/models/tree/master/vision/body_analysis/emotion_ferplus),
+  [MobileNet](https://github.com/onnx/models/tree/master/vision/classification/mobilenet),
+  [ResNet18](https://github.com/onnx/models/tree/master/vision/classification/resnet),
+  [Tiny Yolov2](https://github.com/onnx/models/tree/master/vision/object_detection_segmentation/tiny_yolov2),
+  [Vgg16](https://github.com/onnx/models/tree/master/vision/classification/vgg),
+  and Mnist.
+
+- Some operators now support
+  [multidirectional broadcasting](https://github.com/onnx/onnx/blob/master/docs/Broadcasting.md#multidirectional-broadcasting),
+  including Add, Sub, Mul, Div, Pow, PRelu, Gemm
+
+- [Distributed training with communication optimization].
+  [DistOpt](./python/singa/opt.py) has implemented multiple optimization
+  techniques, including gradient sparsification, chunk transmission, and
+  gradient compression.
+
+- Computational graph construction at the CPP level. The operations submitted to
+  the Device are buffered. After analyzing the dependency, the computational
+  graph is created, which is further analyzed for speed and memory optimization.
+  To enable this feature, use the [Module API](./python/singa/module.py).
+
+- New website based on Docusaurus. The documentation files are moved to a
+  separate repo [singa-doc](https://github.com/apache/singa-doc). The static
+  website files are stored at
+  [singa-site](https://github.com/apache/singa-site).
+
+- DNNL([Deep Neural Network Library](https://github.com/intel/mkl-dnn)), powered
+  by Intel, is integrated into
+  `model/operations/[batchnorm|pooling|convolution]`, the changes is opaque to
+  the end users. The current version is dnnl v1.1 which replaced previous
+  integration of mkl-dnn v0.18. The framework could boost the performance of dl
+  operations when executing on CPU. The dnnl dependency is installed through
+  conda.
+
+- Some Tensor APIs are marked as deprecated which could be replaced by
+  broadcast, and it can support better on multi-dimensional operations. These
+  APIs are add_column(), add_row(), div_column(), div_row(), mult_column(),
+  mult_row()
+
+- Conv and Pooling are enhanced to support fine-grained padding like (2,3,2,3),
+  and
+  [SAME_UPPER, SAME_LOWER](https://github.com/onnx/onnx/blob/master/docs/Operators.md#Conv)
+  pad mode and shape checking.
+
+- Reconstruct soonx,
+  - Support two types of weight value (Initializer and Constant Node);
+  - For some operators (BatchNorm, Reshape, Clip, Slice, Gather, Tile, OneHot),
+    move some inputs to its attributes;
+  - Define and implement the type conversion map.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/releases/RELEASE_NOTES_3.1.0.md b/docs-site/website/versioned_docs/version-3.1.0/releases/RELEASE_NOTES_3.1.0.md
new file mode 100644
index 0000000..dc0fed5
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/releases/RELEASE_NOTES_3.1.0.md
@@ -0,0 +1,51 @@
+---
+id: version-3.1.0-RELEASE_NOTES_3.1.0
+title: Apache SINGA-3.1.0 Release Notes
+original_id: RELEASE_NOTES_3.1.0
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+SINGA is a distributed deep learning library.
+
+This release includes following changes:
+
+- Tensor core:
+
+  - Support tensor transformation (reshape, transpose) for tensors up to 6
+    dimensions.
+  - Implement traverse_unary_transform in Cuda backend, which is similar to CPP
+    backend one.
+
+- Add new tensor operators into the autograd module, including CosSim,
+  DepthToSpace, Embedding, Erf, Expand, Floor, Pad, Round, Rounde, SpaceToDepth,
+  UpSample, Where. The corresponding ONNX operators are thus supported by SINGA.
+
+- Add Embedding and Gemm into the layer module.
+
+- Add SGD operators to opt module, including RMSProp, Adam, and AdaGrad.
+
+- Extend the sonnx module to support DenseNet121, ShuffleNetv1, ShuffleNetv2,
+  SqueezeNet, VGG19, GPT2, and RoBERTa,
+
+- Reconstruct sonnx to
+
+  - Support creating operators from both layer and autograd.
+  - Re-write SingaRep to provide a more powerful intermediate representation of
+    SINGA.
+  - Add a SONNXModel which implements from Model to provide uniform API and
+    features.
+
+- Add one example that trains a BiLSTM model over the InsuranceQA data.
+
+- Replace the Travis CI with Github workflow. Add quality and coverage
+  management.
+
+- Add compiling and packaging scripts to creat wheel packages for distribution.
+
+- Fix bugs
+  - Fix IMDB LSTM model example training script.
+  - Fix Tensor operation Mult on Broadcasting use cases.
+  - Gaussian function on Tensor now can run on Tensor with odd size.
+  - Updated a testing helper function gradients() in autograd to lookup param
+    gradient by param python object id for testing purpose.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/software-stack.md b/docs-site/website/versioned_docs/version-3.1.0/software-stack.md
new file mode 100644
index 0000000..7173c96
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/software-stack.md
@@ -0,0 +1,154 @@
+---
+id: version-3.1.0-software-stack
+title: Software Stack
+original_id: software-stack
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+SINGA's software stack includes two major levels, the low level backend classes
+and the Python interface level. Figure 1 illustrates them together with the
+hardware. The backend components provide basic data structures for deep learning
+models, hardware abstractions for scheduling and executing operations, and
+communication components for distributed training. The Python interface wraps
+some CPP data structures and provides additional high-level classes for neural
+network training, which makes it convenient to implement complex neural network
+models.
+
+Next, we introduce the software stack in a bottom-up manner.
+
+![SINGA V3 software stack](assets/singav3.1-sw.png) <br/> **Figure 1 - SINGA V3
+software stack.**
+
+## Low-level Backend
+
+### Device
+
+Each `Device` instance, i.e., a device, is created against one hardware device,
+e.g. a GPU or a CPU. `Device` manages the memory of the data structures, and
+schedules the operations for executing, e.g., on CUDA streams or CPU threads.
+Depending on the hardware and its programming language, SINGA have implemented
+the following specific device classes:
+
+- **CudaGPU** represents an Nvidia GPU card. The execution units are the CUDA
+  streams.
+- **CppCPU** represents a normal CPU. The execution units are the CPU threads.
+- **OpenclGPU** represents normal GPU card from both Nvidia and AMD. The
+  execution units are the CommandQueues. Given that OpenCL is compatible with
+  many hardware devices, e.g. FPGA and ARM, the OpenclGPU has the potential to
+  be extended for other devices.
+
+### Tensor
+
+`Tensor` class represents a multi-dimensional array, which stores model
+variables, e.g., the input images and feature maps from the convolution layer.
+Each `Tensor` instance (i.e. a tensor) is allocated on a device, which manages
+the memory of the tensor and schedules the (computation) operations against
+tensors. Most machine learning algorithms could be expressed using (dense or
+sparse) the tensor abstraction and its operations. Therefore, SINGA would be
+able to run a wide range of models, including deep learning models and other
+traditional machine learning models.
+
+### Operator
+
+There are two types of operators against tensors, linear algebra operators like
+matrix multiplication, and neural network specific operators like convolution
+and pooling. The linear algebra operators are provided as `Tensor` functions and
+are implemented separately for different hardware devices
+
+- CppMath (tensor_math_cpp.h) implements the tensor operations using Cpp for
+  CppCPU
+- CudaMath (tensor_math_cuda.h) implements the tensor operations using CUDA for
+  CudaGPU
+- OpenclMath (tensor_math_opencl.h) implements the tensor operations using
+  OpenCL for OpenclGPU
+
+The neural network specific operators are also implemented separately, e.g.,
+
+- GpuConvFoward (convolution.h) implements the forward function of convolution
+  via CuDNN on Nvidia GPU.
+- CpuConvForward (convolution.h) implements the forward function of convolution
+  using CPP on CPU.
+
+Typically, users create a `Device` instance and use it to create multiple
+`Tensor` instances. When users call the Tensor functions or neural network
+operations, the corresponding implementation for the resident device will be
+invoked. In other words, the implementation of operators is transparent to
+users.
+
+The Tensor and Device abstractions are extensible to support a wide range of
+hardware device using different programming languages. A new hardware device
+would be supported by adding a new Device subclass and the corresponding
+implementation of the operators.
+
+Optimizations in terms of speed and memory are done by the `Scheduler` and
+`MemPool` of the `Device`. For example, the `Scheduler` creates a
+[computational graph](./graph) according to the dependency of the operators.
+Then it can optimize the execution order of the operators for parallelism and
+memory sharing.
+
+### Communicator
+
+`Communicator` is to support [distributed training](./dist-train). It implements
+the communication protocols using sockets, MPI and NCCL. Typically users only
+need to call the high-level APIs like `put()` and `get()` for sending and
+receiving tensors. Communication optimization for the topology, message size,
+etc. is done internally.
+
+## Python Interface
+
+All the backend components are exposed as Python modules via SWIG. In addition,
+the following classes are added to support the implementation of complex neural
+networks.
+
+### Opt
+
+`Opt` and its subclasses implement the methods (such as SGD) for updating model
+parameter values using parameter gradients. A subclass [DistOpt](./dist-train)
+synchronizes the gradients across the workers for distributed training by
+calling methods from `Communicator`.
+
+### Operator
+
+`Operator` wraps multiple functions implemented using the Tensor or neural
+network operators from the backend. For example, the forward function and
+backward function `ReLU` compose the `ReLU` operator.
+
+### Layer
+
+`Layer` and its subclasses wraps the operators with parameters. For instance,
+convolution and linear operators  
+have weight and bias parameters. The parameters are maintained by the
+corresponding `Layer` class.
+
+### Autograd
+
+[Autograd](./autograd) implements the
+[reverse-mode automatic differentiation](https://rufflewind.com/2016-12-30/reverse-mode-automatic-differentiation)
+by recording the execution of the forward functions of the operators calling the
+backward functions automatically in the reverse order. All functions can be
+buffered by the `Scheduler` to create a [computational graph](./graph) for
+efficiency and memory optimization.
+
+### Model
+
+[Model](./graph) provides an easy interface to implement new network models. You
+just need to inherit `Model` and define the forward propagation of the model by
+creating and calling the layers or operators. `Model` will do autograd and
+update the parameters via `Opt` automatically when training data is fed into it.
+With the `Model` API, SINGA enjoys the advantages of imperative programming and
+declarative programming. Users implement a network using the [Model](./graph)
+API following the imperative programming style like PyTorch. Different from
+PyTorch which recreates the operations in every iteration, SINGA buffers the
+operations to create a computational graph implicitly (when this feature is
+enabled) after the first iteration. The graph is similar to that created by
+libraries using declarative programming, e.g., TensorFlow. Therefore, SINGA can
+apply the memory and speed optimization techniques over the computational graph.
+
+### ONNX
+
+To support ONNX, SINGA implements a [sonnx](./onnx) module, which includes:
+
+- SingaFrontend for saving SINGA model into onnx format.
+- SingaBackend for loading onnx format model into SINGA for training and
+  inference.
diff --git a/docs-site/website/versioned_docs/version-3.1.0/team-list.md b/docs-site/website/versioned_docs/version-3.1.0/team-list.md
new file mode 100644
index 0000000..65664de
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/team-list.md
@@ -0,0 +1,60 @@
+---
+id: version-3.1.0-team-list
+title: The SINGA Team
+original_id: team-list
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+A successful project requires many people to play many roles. Some members write
+code or documentation, while others are valuable as testers, submitting patches
+and suggestions.
+
+The SINGA community has developers mainly from National University of Singapore,
+Zhejiang University, NetEase, Osaka University, yzBigData, etc.
+
+## PMC
+
+| Name          | Email                   | Organization                                  |
+| ------------- | ----------------------- | --------------------------------------------- |
+| Gang Chen     | cg@apache.org           | Zhejiang University                           |
+| Anh Dinh      | dinhtta@apache.org      | Singapore University of Technology and Design |
+| Ted Dunning   | tdunning@apache.org     | Apache Software Foundation                    |
+| Jinyang Gao   | jinyang@apache.org      | DAMO Academy, Alibaba Group                   |
+| Alan Gates    | gates@apache.org        | Apache Software Foundation                    |
+| Zhaojing Luo  | zhaojing@apache.org     | National University of Singapore              |
+| Thejas Nair   | thejas@apache.org       | Apache Software Foundation                    |
+| Beng Chin Ooi | ooibc@apache.org        | National University of Singapore              |
+| Moaz Reyad    | moaz@apache.org         | Université Grenoble Alpes                     |
+| Kian-Lee Tan  | tankianlee@apache.org   | National University of Singapore              |
+| Sheng Wang    | wangsh@apache.org       | DAMO Academy, Alibaba Group                   |
+| Wei Wang      | wangwei@apache.org      | National University of Singapore              |
+| Zhongle Xie   | zhongle@apache.org      | National University of Singapore              |
+| Sai Ho Yeung  | chrishkchris@apache.org | National University of Singapore              |
+| Meihui Zhang  | meihuizhang@apache.org  | Beijing Institute of Technology               |
+| Kaiping Zheng | kaiping@apache.org      | National University of Singapore              |
+
+## Committers
+
+| Name         | Email                  | Organization                                  |
+| ------------ | ---------------------- | --------------------------------------------- |
+| Xiangrui Cai | caixr@apache.org       | Nankai University                             |
+| Chonho Lee   | chonho@apache.org      | Osaka University                              |
+| Shicong Lin  | shicong@apache.org     | National University of Singapore              |
+| Rulin Xing   | rulin@apache.org       | Huazhong University of Science and Technology |
+| Wanqi Xue    | xuewanqi@apache.org    | Nanyang Technological University              |
+| Joddiy Zhang | joddiyzhang@apache.org | National University of Singapore              |
+
+## Contributors
+
+| Name               | Email                        | Organization                     |
+| ------------------ | ---------------------------- | -------------------------------- |
+| Haibo Chen         | hzchenhaibo@corp.netease.com | NetEase                          |
+| Shicheng Chen      | chengsc@comp.nus.edu.sg      | National University of Singapore |
+| Xin Ji             | vincent.j.xin@gmail.com      | Visenze, Singapore               |
+| Anthony K. H. Tung | atung@comp.nus.edu.sg        | National University of Singapore |
+| Ji Wang            | wangji@mzhtechnologies.com   | Hangzhou MZH Technologies        |
+| Yuan Wang          | wangyuan@corp.netease.com    | NetEase                          |
+| Wenfeng Wu         | dcswuw@gmail.com             | Freelancer, China                |
+| Kaiyuan Yang       | yangky@comp.nus.edu.sg       | National University of Singapore |
+| Chang Yao          | yaochang2009@gmail.com       | Hangzhou MZH Technologies        |
diff --git a/docs-site/website/versioned_docs/version-3.1.0/tensor.md b/docs-site/website/versioned_docs/version-3.1.0/tensor.md
new file mode 100644
index 0000000..b86752f
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/tensor.md
@@ -0,0 +1,283 @@
+---
+id: version-3.1.0-tensor
+title: Tensor
+original_id: tensor
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+Each Tensor instance is a multi-dimensional array allocated on a specific Device
+instance. Tensor instances store variables and provide linear algebra operations
+over different types of hardware devices without user awareness. Note that users
+need to make sure the tensor operands are allocated on the same device except
+copy functions.
+
+## Tensor Usage
+
+### Create Tensor
+
+```python
+>>> import numpy as np
+>>> from singa import tensor
+>>> tensor.from_numpy( np.asarray([[1, 0, 0], [0, 1, 0]], dtype=np.float32) )
+[[1. 0. 0.]
+ [0. 1. 0.]]
+```
+
+### Convert to numpy
+
+```python
+>>> a = np.asarray([[1, 0, 0], [0, 1, 0]], dtype=np.float32)
+>>> tensor.from_numpy(a)
+[[1. 0. 0.]
+ [0. 1. 0.]]
+>>> tensor.to_numpy(tensor.from_numpy(a))
+array([[1., 0., 0.],
+       [0., 1., 0.]], dtype=float32)
+```
+
+### Tensor Methods
+
+```python
+>>> t = tensor.from_numpy(a)
+>>> t.transpose([1,0])
+[[1. 0.]
+ [0. 1.]
+ [0. 0.]]
+```
+
+`tensor` transformation up to 6 dims
+
+```python
+>>> a = tensor.random((2,3,4,5,6,7))
+>>> a.shape
+(2, 3, 4, 5, 6, 7)
+>>> a.reshape((2,3,4,5,7,6)).transpose((3,2,1,0,4,5)).shape
+(5, 4, 3, 2, 7, 6)
+```
+
+### Tensor Arithmetic Methods
+
+`tensor` is evaluated in real time.
+
+```python
+>>> t + 1
+[[2. 1. 1.]
+ [1. 2. 1.]]
+>>> t / 5
+[[0.2 0.  0. ]
+ [0.  0.2 0. ]]
+```
+
+`tensor` broadcasting arithmetic:
+
+```python
+>>> a
+[[1. 2. 3.]
+ [4. 5. 6.]]
+>>> b
+[[1. 2. 3.]]
+>>> a + b
+[[2. 4. 6.]
+ [5. 7. 9.]]
+>>> a * b
+[[ 1.  4.  9.]
+ [ 4. 10. 18.]]
+>>> a / b
+[[1.  1.  1. ]
+ [4.  2.5 2. ]]
+>>> a/=b # inplace operation
+>>> a
+[[1.  1.  1. ]
+ [4.  2.5 2. ]]
+```
+
+`tensor` broadcasting on matrix multiplication (GEMM)
+
+```python
+>>> from singa import tensor
+>>> a = tensor.random((2,2,2,3))
+>>> b = tensor.random((2,3,4))
+>>> tensor.mult(a,b).shape
+(2, 2, 2, 4)
+```
+
+### Tensor Functions
+
+Functions in module `singa.tensor` return new `tensor` object after applying the
+transformation defined in the function.
+
+```python
+>>> tensor.log(t+1)
+[[0.6931472 0.        0.       ]
+ [0.        0.6931472 0.       ]]
+```
+
+### Tensor on Different Devices
+
+`tensor` is created on host (CPU) by default; it can also be created on
+different hardware devices by specifying the `device`. A `tensor` could be moved
+between `device`s via `to_device()` function.
+
+```python
+>>> from singa import device
+>>> x = tensor.Tensor((2, 3), device.create_cuda_gpu())
+>>> x.gaussian(1,1)
+>>> x
+[[1.531889   1.0128608  0.12691343]
+ [2.1674204  3.083676   2.7421203 ]]
+>>> # move to host
+>>> x.to_device(device.get_default_device())
+```
+
+### use Tensor to train MLP
+
+```python
+
+"""
+  code snipet from examples/mlp/module.py
+"""
+
+label = get_label()
+data = get_data()
+
+dev = device.create_cuda_gpu_on(0)
+sgd = opt.SGD(0.05)
+
+# define tensor for input data and label
+tx = tensor.Tensor((400, 2), dev, tensor.float32)
+ty = tensor.Tensor((400,), dev, tensor.int32)
+model = MLP(data_size=2, perceptron_size=3, num_classes=2)
+
+# attached model to graph
+model.set_optimizer(sgd)
+model.compile([tx], is_train=True, use_graph=True, sequential=False)
+model.train()
+
+for i in range(1001):
+    tx.copy_from_numpy(data)
+    ty.copy_from_numpy(label)
+    out, loss = model(tx, ty, 'fp32', spars=None)
+
+    if i % 100 == 0:
+        print("training loss = ", tensor.to_numpy(loss)[0])
+```
+
+Output:
+
+```bash
+$ python3 examples/mlp/module.py
+training loss =  0.6158037
+training loss =  0.52852553
+training loss =  0.4571422
+training loss =  0.37274635
+training loss =  0.30146334
+training loss =  0.24906921
+training loss =  0.21128304
+training loss =  0.18390492
+training loss =  0.16362564
+training loss =  0.148164
+training loss =  0.13589878
+```
+
+## Tensor Implementation
+
+The previous section shows the general usage of `Tensor`, the implementation
+under the hood will be covered below. First, the design of Python and C++
+tensors will be introduced. Later part will talk about how the frontend (Python)
+and backend (C++) are connected and how to extend them.
+
+### Python Tensor
+
+Python class `Tensor`, defined in `python/singa/tensor.py`, provides high level
+tensor manipulations for implementing deep learning operations (via
+[autograd](./autograd)), as well as data management by end users.
+
+It primarily works by simply wrapping around C++ tensor methods, both arithmetic
+(e.g. `sum`) and non arithmetic methods (e.g. `reshape`). Some advanced
+arithmetic operations are later introduced and implemented using pure Python
+tensor API, e.g. `tensordot`. Python Tensor APIs could be used to implement
+complex neural network operations easily with the flexible methods available.
+
+### C++ Tensor
+
+C++ class `Tensor`, defined in `include/singa/core/tensor.h`, primarily manages
+the memory that holds the data, and provides low level APIs for tensor
+manipulation. Also, it provides various arithmetic methods (e.g. `matmul`) by
+wrapping different backends (CUDA, BLAS, cuBLAS, etc.).
+
+#### Execution Context and Memory Block
+
+Two important concepts or data structures for `Tensor` are the execution context
+`device`, and the memory block `Block`.
+
+Each `Tensor` is physically stored on and managed by a hardware device,
+representing the execution context (CPU, GPU). Tensor math calculations are
+executed on the device.
+
+Tensor data in a `Block` instance, defined in `include/singa/core/common.h`.
+`Block` owns the underlying data, while tensors take ownership on the metadata
+describing the tensor, like `shape`, `strides`.
+
+#### Tensor Math Backends
+
+To leverage on the efficient math libraries provided by different backend
+hardware devices, SINGA has one set of implementations of Tensor functions for
+each supported backend.
+
+- 'tensor_math_cpp.h' implements operations using Cpp (with CBLAS) for CppCPU
+  devices.
+- 'tensor_math_cuda.h' implements operations using Cuda (with cuBLAS) for
+  CudaGPU devices.
+- 'tensor_math_opencl.h' implements operations using OpenCL for OpenclGPU
+  devices.
+
+### Exposing C++ APIs to Python
+
+SWIG(http://www.swig.org/) is a tool that can automatically convert C++ APIs
+into Python APIs. SINGA uses SWIG to expose the C++ APIs to Python. Several
+files are generated by SWIG, including `python/singa/singa_wrap.py`. The Python
+modules (e.g., `tensor`, `device` and `autograd`) imports this module to call
+the C++ APIs for implementing the Python classes and functions.
+
+```python
+import tensor
+
+t = tensor.Tensor(shape=(2, 3))
+```
+
+For example, when a Python `Tensor` instance is created as above, the `Tensor`
+class implementation creates an instance of the `Tensor` class defined in
+`singa_wrap.py`, which corresponds to the C++ `Tensor` class. For clarity, the
+`Tensor` class in `singa_wrap.py` is referred as `CTensor` in `tensor.py`.
+
+```python
+# in tensor.py
+from . import singa_wrap as singa
+
+CTensor = singa.Tensor
+```
+
+### Create New Tensor Functions
+
+With the groundwork set by the previous description, extending tensor functions
+could be done easily in a bottom up manner. For math operations, the steps are:
+
+- Declare the new API to `tensor.h`
+- Generate code using the predefined macro in `tensor.cc`, refer to
+  `GenUnaryTensorFn(Abs);` as an example.
+- Declare the template method/function in `tensor_math.h`
+- Do the real implementation at least for CPU (`tensor_math_cpp.h`) and
+  GPU(`tensor_math_cuda.h`)
+- Expose the API via SWIG by adding it into `src/api/core_tensor.i`
+- Define the Python Tensor API in `tensor.py` by calling the automatically
+  generated function in `singa_wrap.py`
+- Write unit tests where appropriate
+
+## Python API
+
+_work in progress_
+
+## CPP API
+
+_work in progress_
diff --git a/docs-site/website/versioned_docs/version-3.1.0/time-profiling.md b/docs-site/website/versioned_docs/version-3.1.0/time-profiling.md
new file mode 100644
index 0000000..2ef4ae1
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/time-profiling.md
@@ -0,0 +1,166 @@
+---
+id: version-3.1.0-time-profiling
+title: Time Profiling
+original_id: time-profiling
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+SINGA supports the time profiling of each of the operators buffered in the
+graph. To utilize the time profiling function, we first call the
+`device.SetVerbosity` method to set the verbosity of the time profilier, and
+then call the `device.PrintTimeProfiling` to print out the results of time
+profiling.
+
+## Setup the Time Profiling Verbosity
+
+To use the time profiling function, we need to set the verbosity. There are
+three levels of verbosity. With the default value `verbosity == 0`, it will not
+do any time profiling. When we set `verbosity == 1`, it will profile the forward
+and backward propagation time. When `verbosity == 2`, it will profile the time
+spent on every buffered operation in the graph.
+
+The following is the example code to setup the time profiling function:
+
+```python
+# create a device
+from singa import device
+dev = device.create_cuda_gpu()
+# set the verbosity
+verbosity = 2
+dev.SetVerbosity(verbosity)
+# optional: skip the first 5 iterations when profiling the time
+dev.SetSkipIteration(5)
+```
+
+Then, after we have completed the training at the end of the program, we can
+print the time profiling result by calling the `device.PrintTimeProfiling`
+method:
+
+```python
+dev.PrintTimeProfiling()
+```
+
+## Example Outputs for Different Verbosity
+
+We can run the ResNet
+[example](https://github.com/apache/singa/blob/master/examples/cnn/benchmark.py)
+to see the output with different setting of verbosity:
+
+1. `verbosity == 1`
+
+```
+Time Profiling:
+Forward Propagation Time : 0.0409127 sec
+Backward Propagation Time : 0.114813 sec
+```
+
+2. `verbosity == 2`
+
+```
+Time Profiling:
+OP_ID0. SetValue : 1.73722e-05 sec
+OP_ID1. cudnnConvForward : 0.000612724 sec
+OP_ID2. GpuBatchNormForwardTraining : 0.000559449 sec
+OP_ID3. ReLU : 0.000375004 sec
+OP_ID4. GpuPoolingForward : 0.000240041 sec
+OP_ID5. SetValue : 3.4176e-06 sec
+OP_ID6. cudnnConvForward : 0.000115619 sec
+OP_ID7. GpuBatchNormForwardTraining : 0.000150415 sec
+OP_ID8. ReLU : 9.95494e-05 sec
+OP_ID9. SetValue : 3.22432e-06 sec
+OP_ID10. cudnnConvForward : 0.000648668 sec
+OP_ID11. GpuBatchNormForwardTraining : 0.000149793 sec
+OP_ID12. ReLU : 9.92118e-05 sec
+OP_ID13. SetValue : 3.37728e-06 sec
+OP_ID14. cudnnConvForward : 0.000400953 sec
+OP_ID15. GpuBatchNormForwardTraining : 0.000572181 sec
+OP_ID16. SetValue : 3.21312e-06 sec
+OP_ID17. cudnnConvForward : 0.000398698 sec
+OP_ID18. GpuBatchNormForwardTraining : 0.00056836 sec
+OP_ID19. Add : 0.000542246 sec
+OP_ID20. ReLU : 0.000372783 sec
+OP_ID21. SetValue : 3.25312e-06 sec
+OP_ID22. cudnnConvForward : 0.000260731 sec
+OP_ID23. GpuBatchNormForwardTraining : 0.000149041 sec
+OP_ID24. ReLU : 9.9072e-05 sec
+OP_ID25. SetValue : 3.10592e-06 sec
+OP_ID26. cudnnConvForward : 0.000637481 sec
+OP_ID27. GpuBatchNormForwardTraining : 0.000152577 sec
+OP_ID28. ReLU : 9.90518e-05 sec
+OP_ID29. SetValue : 3.28224e-06 sec
+OP_ID30. cudnnConvForward : 0.000404586 sec
+OP_ID31. GpuBatchNormForwardTraining : 0.000569679 sec
+OP_ID32. Add : 0.000542291 sec
+OP_ID33. ReLU : 0.00037211 sec
+OP_ID34. SetValue : 3.13696e-06 sec
+OP_ID35. cudnnConvForward : 0.000261219 sec
+OP_ID36. GpuBatchNormForwardTraining : 0.000148281 sec
+OP_ID37. ReLU : 9.89299e-05 sec
+OP_ID38. SetValue : 3.25216e-06 sec
+OP_ID39. cudnnConvForward : 0.000633644 sec
+OP_ID40. GpuBatchNormForwardTraining : 0.000150711 sec
+OP_ID41. ReLU : 9.84902e-05 sec
+OP_ID42. SetValue : 3.18176e-06 sec
+OP_ID43. cudnnConvForward : 0.000402752 sec
+OP_ID44. GpuBatchNormForwardTraining : 0.000571523 sec
+OP_ID45. Add : 0.000542435 sec
+OP_ID46. ReLU : 0.000372539 sec
+OP_ID47. SetValue : 3.24672e-06 sec
+OP_ID48. cudnnConvForward : 0.000493054 sec
+OP_ID49. GpuBatchNormForwardTraining : 0.000293142 sec
+OP_ID50. ReLU : 0.000190047 sec
+OP_ID51. SetValue : 3.14784e-06 sec
+OP_ID52. cudnnConvForward : 0.00148837 sec
+OP_ID53. GpuBatchNormForwardTraining : 8.34794e-05 sec
+OP_ID54. ReLU : 5.23254e-05 sec
+OP_ID55. SetValue : 3.40096e-06 sec
+OP_ID56. cudnnConvForward : 0.000292971 sec
+OP_ID57. GpuBatchNormForwardTraining : 0.00029174 sec
+OP_ID58. SetValue : 3.3248e-06 sec
+OP_ID59. cudnnConvForward : 0.000590154 sec
+OP_ID60. GpuBatchNormForwardTraining : 0.000294149 sec
+OP_ID61. Add : 0.000275119 sec
+OP_ID62. ReLU : 0.000189268 sec
+OP_ID63. SetValue : 3.2704e-06 sec
+OP_ID64. cudnnConvForward : 0.000341232 sec
+OP_ID65. GpuBatchNormForwardTraining : 8.3304e-05 sec
+OP_ID66. ReLU : 5.23667e-05 sec
+OP_ID67. SetValue : 3.19936e-06 sec
+OP_ID68. cudnnConvForward : 0.000542484 sec
+OP_ID69. GpuBatchNormForwardTraining : 8.60537e-05 sec
+OP_ID70. ReLU : 5.2479e-05 sec
+OP_ID71. SetValue : 3.41824e-06 sec
+OP_ID72. cudnnConvForward : 0.000291295 sec
+OP_ID73. GpuBatchNormForwardTraining : 0.000292795 sec
+OP_ID74. Add : 0.000274438 sec
+OP_ID75. ReLU : 0.000189689 sec
+OP_ID76. SetValue : 3.21984e-06 sec
+OP_ID77. cudnnConvForward : 0.000338776 sec
+OP_ID78. GpuBatchNormForwardTraining : 8.484e-05 sec
+OP_ID79. ReLU : 5.29408e-05 sec
+OP_ID80. SetValue : 3.18208e-06 sec
+OP_ID81. cudnnConvForward : 0.000545542 sec
+OP_ID82. GpuBatchNormForwardTraining : 8.40976e-05 sec
+OP_ID83. ReLU : 5.2256e-05 sec
+OP_ID84. SetValue : 3.36256e-06 sec
+OP_ID85. cudnnConvForward : 0.000293003 sec
+OP_ID86. GpuBatchNormForwardTraining : 0.0002989 sec
+OP_ID87. Add : 0.000275041 sec
+OP_ID88. ReLU : 0.000189867 sec
+OP_ID89. SetValue : 3.1184e-06 sec
+OP_ID90. cudnnConvForward : 0.000340417 sec
+OP_ID91. GpuBatchNormForwardTraining : 8.39395e-05 sec
+OP_ID92. ReLU : 5.26544e-05 sec
+OP_ID93. SetValue : 3.2336e-06 sec
+OP_ID94. cudnnConvForward : 0.000539787 sec
+OP_ID95. GpuBatchNormForwardTraining : 8.2753e-05 sec
+OP_ID96. ReLU : 4.86758e-05 sec
+OP_ID97. SetValue : 3.24384e-06 sec
+OP_ID98. cudnnConvForward : 0.000287108 sec
+OP_ID99. GpuBatchNormForwardTraining : 0.000293127 sec
+OP_ID100. Add : 0.000269478 sec
+.
+.
+.
+```
diff --git a/docs-site/website/versioned_docs/version-3.1.0/wheel-cpu-dev.md b/docs-site/website/versioned_docs/version-3.1.0/wheel-cpu-dev.md
new file mode 100644
index 0000000..8de7fb1
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/wheel-cpu-dev.md
@@ -0,0 +1,13 @@
+---
+id: version-3.1.0-wheel-cpu-dev
+title: CPU only Wheel Packages (develop version)
+original_id: wheel-cpu-dev
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+## 3.0.0.dev200720
+
+- [Python 3.6](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0.dev200720-cp36-cp36m-manylinux2014_x86_64.whl)
+- [Python 3.7](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0.dev200720-cp37-cp37m-manylinux2014_x86_64.whl)
+- [Python 3.8](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0.dev200720-cp38-cp38-manylinux2014_x86_64.whl)
diff --git a/docs-site/website/versioned_docs/version-3.1.0/wheel-cpu.md b/docs-site/website/versioned_docs/version-3.1.0/wheel-cpu.md
new file mode 100644
index 0000000..824d4a4
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/wheel-cpu.md
@@ -0,0 +1,19 @@
+---
+id: version-3.1.0-wheel-cpu
+title: CPU only Wheel Packages
+original_id: wheel-cpu
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+## 3.1.0
+
+- [Python 3.6](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.1.0-cp36-cp36m-manylinux2014_x86_64.whl)
+- [Python 3.7](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.1.0-cp37-cp37m-manylinux2014_x86_64.whl)
+- [Python 3.8](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.1.0-cp38-cp38-manylinux2014_x86_64.whl)
+
+## 3.0.0
+
+- [Python 3.6](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0-cp36-cp36m-manylinux2014_x86_64.whl)
+- [Python 3.7](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0-cp37-cp37m-manylinux2014_x86_64.whl)
+- [Python 3.8](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0-cp38-cp38-manylinux2014_x86_64.whl)
diff --git a/docs-site/website/versioned_docs/version-3.1.0/wheel-gpu-dev.md b/docs-site/website/versioned_docs/version-3.1.0/wheel-gpu-dev.md
new file mode 100644
index 0000000..120125b
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/wheel-gpu-dev.md
@@ -0,0 +1,13 @@
+---
+id: version-3.1.0-wheel-gpu-dev
+title: Wheel Packages with CUDA enabled (develop version)
+original_id: wheel-gpu-dev
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+## 3.0.0.dev200720
+
+- [CUDA10.2, cuDNN 7.6.5, Python 3.6](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0.dev200720%2Bcuda10.2-cp36-cp36m-manylinux2014_x86_64.whl)
+- [CUDA10.2, cuDNN 7.6.5, Python 3.7](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0.dev200720%2Bcuda10.2-cp37-cp37m-manylinux2014_x86_64.whl)
+- [CUDA10.2, cuDNN 7.6.5, Python 3.8](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0.dev200720%2Bcuda10.2-cp38-cp38-manylinux2014_x86_64.whl)
diff --git a/docs-site/website/versioned_docs/version-3.1.0/wheel-gpu.md b/docs-site/website/versioned_docs/version-3.1.0/wheel-gpu.md
new file mode 100644
index 0000000..7bd0221
--- /dev/null
+++ b/docs-site/website/versioned_docs/version-3.1.0/wheel-gpu.md
@@ -0,0 +1,22 @@
+---
+id: version-3.1.0-wheel-gpu
+title: Wheel Packages with CUDA Enabled
+original_id: wheel-gpu
+---
+
+<!--- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements.  See the NOTICE file distributed with this work for additional information regarding copyright ownership.  The ASF licenses this file to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License.  You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the License for the specific language governing permissions and limitations under the License.  -->
+
+## 3.1.0
+
+- [CUDA10.2, cuDNN 7.6.5, Python
+  3.6]https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.1.0%2Bcuda10.2-cp36-cp36m-manylinux2014_x86_64.whl)
+- [CUDA10.2, cuDNN 7.6.5, Python
+  3.7]https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.1.0%2Bcuda10.2-cp37-cp37m-manylinux2014_x86_64.whl)
+- [CUDA10.2, cuDNN 7.6.5, Python
+  3.8]https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.1.0%2Bcuda10.2-cp38-cp38-manylinux2014_x86_64.whl)
+
+## 3.0.0
+
+- [CUDA10.2, cuDNN 7.6.5, Python 3.6](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0%2Bcuda10.2-cp36-cp36m-manylinux2014_x86_64.whl)
+- [CUDA10.2, cuDNN 7.6.5, Python 3.7](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0%2Bcuda10.2-cp37-cp37m-manylinux2014_x86_64.whl)
+- [CUDA10.2, cuDNN 7.6.5, Python 3.8](https://singa-wheel.s3-ap-southeast-1.amazonaws.com/singa-3.0.0%2Bcuda10.2-cp38-cp38-manylinux2014_x86_64.whl)
diff --git a/docs-site/website/versioned_sidebars/version-3.1.0-sidebars.json b/docs-site/website/versioned_sidebars/version-3.1.0-sidebars.json
new file mode 100644
index 0000000..dcd83c5
--- /dev/null
+++ b/docs-site/website/versioned_sidebars/version-3.1.0-sidebars.json
@@ -0,0 +1,37 @@
+{
+  "version-3.1.0-docs": {
+    "Getting Started": [
+      "version-3.1.0-installation",
+      "version-3.1.0-software-stack",
+      "version-3.1.0-examples"
+    ],
+    "Guides": [
+      "version-3.1.0-device",
+      "version-3.1.0-tensor",
+      "version-3.1.0-autograd",
+      "version-3.1.0-optimizer",
+      "version-3.1.0-graph",
+      "version-3.1.0-onnx",
+      "version-3.1.0-dist-train",
+      "version-3.1.0-time-profiling"
+    ],
+    "Development": [
+      "version-3.1.0-download-singa",
+      "version-3.1.0-build",
+      "version-3.1.0-contribute-code",
+      "version-3.1.0-contribute-docs",
+      "version-3.1.0-how-to-release",
+      "version-3.1.0-git-workflow"
+    ]
+  },
+  "version-3.1.0-community": {
+    "Community": [
+      "version-3.1.0-source-repository",
+      "version-3.1.0-mail-lists",
+      "version-3.1.0-issue-tracking",
+      "version-3.1.0-security",
+      "version-3.1.0-team-list",
+      "version-3.1.0-history-singa"
+    ]
+  }
+}
diff --git a/docs-site/website/versions.json b/docs-site/website/versions.json
index 739bce7..59b2d18 100644
--- a/docs-site/website/versions.json
+++ b/docs-site/website/versions.json
@@ -1,4 +1,5 @@
 [
+  "3.1.0",
   "3.0.0",
   "3.0.0.rc1",
   "2.0.0"