Merge pull request #12 from dcslin/tensor-impl

polish tensor implementation section
diff --git a/docs-site/docs/tensor.md b/docs-site/docs/tensor.md
index ca9c21e..5240e67 100644
--- a/docs-site/docs/tensor.md
+++ b/docs-site/docs/tensor.md
@@ -138,9 +138,28 @@
 ```
 
 ## Tensor implementation
+Previous section shows users the general usage of `Tensor`, the implementation under the hood will be covered below. First, the design of python and C++ tensors will be introduced. Later part will talk about how the frontend(Python) and backend(C++) are connected and how to extend them.
 
-SINGA has three different sets of implmentations of Tensor functions, one for
-each type of Device.
+### Python Tensor:
+
+Python class `Tensor`, defined in `python/singa/tensor.py`, provides high level tensor manipulation access for implementing deep learning architecture(autograd and models), as well as data management by end users.
+
+It primarily works by simply wrapping around C++ tensor methods, both arithmetic (e.g. `sum`) and non arithmetic methods (e.g. `reshape`). Some advanced arithmetic operations are later introduced and implemented in pure python tensor API, e.g. `tensordot`. Python Tensor API could be used to implement complex operations easily with the flexible methods available.
+
+API are grouped into `Tensor` methods, and functions that take `Tensor` as inputs.
+
+### C++ Tensor:
+C++ class `Tensor`, defined in `include/singa/core/tensor.h`, primarily manages the memory that holds the data, and provides low level APIs for tensor manipulation. Also, it provides various arithmetic methods(e.g. `matmul`) by wrapping different backends.
+
+#### Execution context and Memory Block
+Two important concepts or data structures for `Tensor` are execution context `device`, and memory block `Block`.
+
+Each `Tensor` is linked to one device, representing the execution context (CPU, GPU). Tensor math calculations are asynchronously delayed to be executed on the linked device.
+
+Tensor data are stored in class `Block`, defined in `include/singa/core/common.h`. `Block` owns the underlying data, while tensors take ownership on the metadata describing the tensor, like `shape`, `strides`.
+
+#### Tensor Math backends:
+To leverage on the efficient math library provided by different backend hardwards,  SINGA has three different sets of implementations of Tensor functions, one for each type of Device.
 
 - 'tensor_math_cpp.h' implements operations using Cpp (with CBLAS) for CppGPU
   devices.
@@ -149,6 +168,23 @@
 - 'tensor_math_opencl.h' implements operations using OpenCL for OpenclGPU
   devices.
 
+### Connecting C++ Tensor and Python - SWIG
+
+While C++ Tensor could be used standalone as a Tensor library, it is further extended to bridge Python code by [SWIG](http://www.swig.org/). SWIG can automatically compile C++ API into Python modules.
+
+When compiling from source, several files are generated by SWIG, including `python/singa/singa_wrap.py`. The Python `Tensor` class imports this module and could use C++ API with ease.
+
+### Create new Tensor functions:
+With the groundwork set by the previous description, extending tensor functions could be done easily in bottom up manner. For math operations, the steps are: 
+- Add new API to `tensor.h`
+- Add the code generation by predefined macro in `tensor.cc`, refer to `GenUnaryTensorFn(Abs);` as an example.
+- Add template method/function placeholder in `tensor_math.h`  
+- Fill up implementation at least for CPU(`tensor_math_cpp.h`) and GPU(`tensor_math_cuda.h`)
+- Extend the API exposed to SWIG for translation in `src/api/core_tensor.i`
+- Wrap generated python API `python/singa/singa_wrap.py` and make it consistent with python `Tensor` API.
+- Write unit tests where appropriate
+
+
 ## Python API
 
 _work in progress_