3rdparty/mshadow/doc/README.md - mxnet - Git at Google

 <!--- Licensed to the Apache Software Foundation (ASF) under one -->
 <!--- or more contributor license agreements.  See the NOTICE file -->
 <!--- distributed with this work for additional information -->
 <!--- regarding copyright ownership.  The ASF licenses this file -->
 <!--- to you under the Apache License, Version 2.0 (the -->
 <!--- "License"); you may not use this file except in compliance -->
 <!--- with the License.  You may obtain a copy of the License at -->

 <!---   http://www.apache.org/licenses/LICENSE-2.0 -->

 <!--- Unless required by applicable law or agreed to in writing, -->
 <!--- software distributed under the License is distributed on an -->
 <!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
 <!--- KIND, either express or implied.  See the License for the -->
 <!--- specific language governing permissions and limitations -->
 <!--- under the License. -->

 MShadow Documentation
 =====
 This is the documentation for mshadow: A Lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA.

 ### Links to Topics

 * [Tutorial](../guide)
 * [API Documentation](http://homes.cs.washington.edu/~tqchen/mshadow/doc)
   - You can run ```./mkdoc.sh``` to make the document locally
 * [Tutorial about Expression Template](../guide/exp-template)
 * [Writing Multi-GPU and Distributed ML](../guide/mshadow-ps)
 * [Compile Configuration script](../make)
 * [Expression API](#expression-api)
   - Expression api introduces the concept of expression in mshadow

 Expression API
 =====
 Expression is the key concept in mshadow, a common operation of mshadow is ```tensor = some code to construct expression```

 There are three major types of expression:
 * Mapper expression: only contain element-wise operations of Mapper expressions
   - Mapper expression can used as composition component of other operations.
   - Tensor, scalar are Mapper expressions
   - Example: ``` weight =  - eta * (grad + lambda * weight)```  is a Mapper expression.
   - Mapper expressions are translated using expression template code implemented by mshadow.
   - ***Assign safety***: Element-wise mapping are assign safe, which means, we can write ```A = A * 2 + B```, making lvalue appear in expression, the results are still correct.
 * Chainer expression: may contain element-wise operation such as reduction and broadcast
   - Example: ```dst = mirror(src)``` is a chainer expression
   - ***Assign safety***: Most of the chainer extensions are not assignment safe, which means user should avoid putting target in source epression.
 * Complex expression: complex operations, need special translation rule to translate to specific implementations.
    - Complex expression can not be used as composition component of other operations.
    - Example: ``` dot(lhs.T(), rhs)```,  is complex expression, we can not write
 ``` dst =  1.0 + dot(lhs.T(), rhs)```
    - But limited syntax is supported depending on specification, for example, we do support ``` dst +=  2.0f * dot(lhs.T(), rhs)```
    - Complex expressions are translated into specific implementations such as BLAS.

 ### Element-wise Operations
 The basic binary operators are overloaded to composite Mapper expressions, so we can write
 ```c++
 weight = (-eta) * (grad + lambda * weight);
 ```
 We can also use customized binary operators, and unary operators:
 ```c++
 struct maximum {
   MSHADOW_XINLINE static float Map(float a, float b) {
     return a > b ? a : b;
   }
 };
 template<typename xpu>
 void ExampleMaximum(Tensor<xpu, 2> out,
                     const Tensor<xpu, 2> &A,
                     const Tensor<xpu, 2> &B) {
   out= 10.0f * F<maximum>(A+1.0f, B);
 }
 struct sigmoid {
   MSHADOW_XINLINE static float Map(float a) {
     return 1.0f/(1.0f+expf(-a));
   }
 };
 template<typename xpu>
 void ExampleSigmoid(Tensor<xpu, 2> out, const Tensor<xpu, 2> &in) {
   // equivalent to out = sigmoid(in*2) + 1;
   out = F<op::plus>(F<sigmoid>(in * 2.0f), ScalarExp(1.0f));
 }
 ```
 ### Matrix Multiplications
 Matrix multiplications are supported by following syntax, with things brackets [] are optional
 ```
 dst <sv> [scale*] dot(lhs [.T()] , rhs [.T()]), <sv> can be =,+=,-=
 ```
 Example:
 ```c++
 template<typename xpu>
 void Backprop(Tensor<xpu, 2> gradin,
               const Tensor<xpu, 2> &gradout,
               const Tensor<xpu, 2> &netweight) {
   gradin = 2.0 * dot(gradout, netweight.T());
 }
 ```

 ### Introducing Expression Extensions
 Naming conventions:
 * ```Tensor<xpu, dim>``` to refer to any Tensor with device any device and dimension.
 * ```xpu```, ```dim```, are implicit template parameters.
 * ```Expr<xpu, dim>``` will be used to refer to any mapper expression with type ```Tensor<xpu,dim>```.

 List of functions:
 * [reshape](#reshape): reshapes a tensor to another shape, number of content must be same
 * [broadcast<?>](#broadcast): replicate a 1 dimension tensor in certain dimension
 * [repmat](#repmat), special case of broadcast<0>: repeat vector over rows to form a matrix
 * [sumall_except_dim<?>](#sumall_except_dim): sum over all the dimensions, except the dimension specified in template parameter
 * [sum_rows](#sum_rows): special case of sumall_except_dim<0>, sum of rows in the matrix
 * [unpack_patch2col](#unpack_patch2col): unpack local (overlap) patches of image to column of mat, can be used to implement convolution
 * [pack_col2patch](#pack_col2patch): reverse operation of unpack_patch2col, can be used to implement deconvolution
 * [pool](#pool): do pooling on image
 * [unpool](#unpool): get gradient of pooling result
 * [crop](#crop): crop the original image to a smaller size
 * [mirror](#mirror): get the mirrored result of input expression

 ======
 ##### reshape
 * ```reshape(Expr<xpu,dim> src, Shape<dimdst> oshape)```
 * reshapes a tensor to another shape, total number of elements must be same
 * parameters:
   - src:  input data
   - oshape: target shape
 * result expression type: ```Tensor<xpu, dimdst>``` with ```shape=oshape```, is Mapper expression
 ```c++
 void ExampleReshape(void) {
   Tensor<cpu, 2> dst = NewTensor<cpu>(Shape2(4, 5));
   Tensor<cpu, 1> src = NewTensor<cpu>(Shape1(20), 1.0f);
   dst = reshape(src, dst.shape_);
   ...
 }
 ```
 ======

 ##### broadcast
 * ```broadcast<dimcast>(Tensor<xpu,1> src, Shape<dimdst> oshape)```
 * replicate a 1 dimension tensor certain dimension, specified by template parameter dimcast
 * parameters:
   - src: input 1 dimensional tensor
   - oshape: shape of output
 * return expression type: ```Tensor<xpu, dimdst>```, ```shape = oshape```, is Chainer expression
 ```c++
 void ExampleBroadcast(void) {
   Tensor<cpu, 2> dst = NewTensor<cpu>(Shape2(2, 3));
   Tensor<cpu, 1> src = NewTensor<cpu>(Shape1(2), 1.0f);
   src[0] = 2.0f; src[1] = 1.0f;
   dst = broadcast<0>(src, dst.shape_);
   // dst[0][0] = 2, dst[0][1] = 2; dst[1][0]=1, dst[1][1] = 1
   ...
 }
 ```
 ======
 ##### repmat
 * ```repmat(Tensor<xpu, 1> src, int nrows) ```
 * special case of broadcast, repeat 1d tensor over rows
 * input parameters:
   - src: input vector
   - nrows: number of rows in target
 * return expression type:  ```Tensor<xpu, 2>```, with ```shape=(nrows, src.size(0))```,  is Chainer expression
 ```c++
 void ExampleRepmat(void) {
   Tensor<cpu,2> dst = NewTensor<cpu>(Shape2(3, 2));
   Tensor<cpu,1> src = NewTensor<cpu>(Shape1(2), 1.0f);
   src[0] = 2.0f; src[1] = 1.0f;
   dst = repmat(src, 3);
   // dst[0][0] = 2, dst[0][1] = 1; dst[1][0]=2, dst[1][1] = 1
   ...
 }
 ```
 ======
 ##### sumall_except_dim
 * ```sumall_except_dim<dimkeep>(Expr<xpu,dim> src) ```
 * sum over all dimensions, except dimkeep
 * input parameters:
   - src: input mapper expression
 * return expression type:  ```Tensor<xpu, 1>```, with ```shape=(src.size(dimkeep))```,  is Complex expression
 * Syntax: ```dst [sv] [scale*] sumall_except_dim<dimkeep>(src) , <sv> can be =, +=, -=, *=, /=````
 ```c++
 void ExampleSumAllExceptDim(void) {
   Tensor<cpu,3> src = NewTensor<cpu>(Shape3(2, 3, 2), 1.0f);
   Tensor<cpu,1> dst = NewTensor<cpu>(Shape1(3), 1.0f);
   dst += sum_all_except<1>(src * 2.0f);
   // dst[0] = 1.0 + 4.0 *2.0 = 9.0
   ...
 }
 ```
 ======
 ##### sum_rows
 * ```sum_rows(Expr<xpu, 2> src) ```
 * sum of rows in the matrix
 * input parameters:
   - src: input mapper  expression
 * return expression type:  ```Tensor<xpu,1>```, with ```shape=(src.size(0))```,  is Complex expression
 * Syntax: ```dst [sv] [scale*] sum_rows(src) , <sv> can be =,+=,-=,*=,/=````
 ```c++
 void ExampleSumRows(void) {
   Tensor<cpu, 2> src = NewTensor<cpu>(Shape2(3, 2), 1.0f);
   Tensor<cpu, 1> dst = NewTensor<cpu>(Shape1(2), 1.0f);
   dst += sum_rows(src + 1.0f);
   // dst[0] = 1.0 + 3.0 *(1.0+1.0) = 7.0
   ...
 }
 ```
 ======
 ##### unpack_patch2col
 * ```unpack_patch2col(Expr<xpu,3> img, int psize_y, int p_size_x, int pstride) ```
 * unpack local (overlap) patches of image to column of mat, can be used to implement convolution, after getting unpacked mat, we can use: ```output = dot(weight, mat)``` to get covolved results, the relations:
   - weight; shape[0]: out_channel, shape[1]: ichannel * psize_y * psize_x
   - output; shape[0]: out_channel, shape[1]: out_height * out_width * num_of_images
   -  out_height = (in_height - psize_y) / pstride + 1, this means we pad inperfect patch with 0
   - out_width  = (in_width - psize_x) / pstride + 1
 * input parameters:
   - img: source image, can be expression; (in_channels, in_height, in_width)
   - psize_y height of each patch
   - psize_x width of each patch
   - pstride: stride of each patch
 * return expression type:  ```Tensor<xpu, 2>```, with ```shape=(in_channel*psize_x*psize_y, out_height*out_width)```,  is Chainer expression
 ```c++
 void ExampleCovolution(Tensor<cpu, 3> dst, Tensor<cpu, 3> src,
                        Tensor<cpu, 2> weight, int ksize, int stride) {
   int o_height = (src.size(1)- ksize) / stride + 1;
   int o_width  = (src.size(2)- ksize) / stride + 1;
   utils::Assert(weight.size(1) == src.size(0) * ksize * ksize);
   TensorContainer<cpu, 2> tmp_col(Shape2(src.size(0) * ksize * ksize,
                                          o_height * o_width));
   TensorContainer<cpu, 2> tmp_dst(Shape2(weight.size(0),
                                          o_height * o_width));
   tmp_col = unpack_patch2col(src, ksize, ksize, stride);
   tmp_dst = dot(weight, tmp_col);
   dst = reshape(tmp_dst, dst.shape_);
 }
 ```

 ======
 ##### pack_col2patch
 * ```pack_col2patch(Tensor<xpu, 2> mat, Shape<3> imshape, int psize_y, int psize_x, int pstride) ````
 * reverse operation of unpack_patch2col, can be used to implement deconvolution
 * input parameters:
   - mat: source mat, same shape as output of unpack_patch2col
   - imshape: shape of target image
   - psize_y height of each patch
   - psize_x width of each patch
   - pstride: stride of each patch
 * return expression type:  ```Tensor<xpu, 3>```, with ```shape = imshape```,  is Chainer expression
 ```c++
 void ExampleDecovolution(Tensor<cpu, 3> bottom, Tensor<cpu, 3> top,
                          Tensor<cpu, 2> weight, int ksize, int stride) {
   int o_height = (bottom.size(1)- ksize) / stride + 1;
   int o_width  = (bottom.size(2)- ksize) / stride + 1;
   utils::Assert(weight.size(1) == bottom.size(0) * ksize * ksize);
   TensorContainer<cpu, 2> tmp_col(Shape2(bottom.size(0) * ksize * ksize,
                                          o_height * o_width));
   TensorContainer<cpu, 2> tmp_dst(Shape2(weight.size(0), o_height*o_width));
   tmp_dst = reshape(top, tmp_dst.shape_);
   tmp_col = dot(weight.T(), tmp_dst);
   bottom = pack_col2patch(tmp_col, bottom.shape_, ksize, ksize, stride);
 }
 ```

 ======
 ##### pool
 * ```pool<Reducer>(Expr<xpu, dim> img, [Shape<2> pshape,] int ksize_y, int ksize_x, int kstride)```
 * Pooling on image with specify kernel size and stride, can be used to implement max pooilng and other pooling layer
 * input parameters:
   - Reducer: operation can be max or sum
   - img: source image, can be expression; (in_channels, in_height, in_width)
   - [optional] Shape<2> pshape, output shape
   - ksize_y height of each patch
   - ksize_x width of each patch
   - kstride: stride of each patch
 * return expression:  ```Expr<xpu, dim>```, with ```shape = (in_channel, (out_height - ksize) / kstride + 1, (out_width - ksize) / kstride + 1)```, or expression in pshape
   - Chainer expression
 ```c++
 void ExampleMaxPooling(TensorContainer<cpu, 3> &data, int ksize, int stride) {
   TensorContainer<cpu, 3> pooled(Shape3(data.size(0),
                                         (data.size(2) - ksize) / kstride + 1),
                                         (data.size(1) - ksize) / kstride + 1));
   pooled = pool<red::maximum>(data, ksize, ksize, stride);
 }
 ```

 ======
 ##### unpool
 * ```unpool<Reducer>(Tensor<xpu, 4> data_src, Tensor<xpu, 4> data_pooled, Tensor<xpu, 4> grad_pooled, int ksize_y,  int ksize_x, int kstride)```
 * Unpooling on image with specify kernel size and stride, can be used to implement backprop of max pooilng and other pooling layer
 * input parameters:
   - Reducer: operation can be max or sum
   - data_src: source image batch.
   - data_pooled: pooled image batch.
   - grad_pooled: gradient of upper layer
   - ksize_y height of each patch
   - ksize_x width of each patch
   - kstride: stride of each patch
 * return:
   Expression, same shape to data_src
 ```c++
 void ExampleMaxUnpooling(Tensor<cpu, 4> &data_src, Tensor<cpu, 4> &data_pooled,
                          Tensor<cpu, 4> &grad_pooled, int ksize, int kstride) {
   TensorContainer<cpu, 4> grad(data_src.shape_);
   grad = unpool<red::maximum>(data_src, data_pooled,
                               grad_pooled, ksize, ksize, kstride);
 }
 ```

 ======
 ##### crop
 * ```crop(Expr<xpu, dim> src, Shape<2> oshape, int start_height, int start_width)```
 * input parameters:
  - src: input expression
  - oshape: output shape after crop
  - start_height: start height for cropping
  - start_width: start width for cropping
 * Can also be ```crop(Expr<xpu, dim> src, Shape<2> oshape)``` where the crop will happen in center.
 * return
  - cropped expression
 ```c++
 void ExampleCrop(TensorContainer<cpu, 3> img, int start_height, int start_width) {
   TensorContainer<cpu> cropped(Shape3(img.size(0),
                                       img.size(1) - start_height,
                                       img.size(2) - start_width));
   cropped = crop(img, start_height, start_width);
 }
 ```

 ======
 ##### mirror
 * ```mirrow(Expr<xpu, dim> src)```
 * input:
     - src, source expression to be mirrored
 * output:
     - expression of mirrored result
 ```c++
 void ExampleMirror(TensorContainer<cpu, 3> img) {
   TensorContainer<cpu> mirrored(img.shape_);
   mirrored = mirror(img);
 }
 ```
	<!--- Licensed to the Apache Software Foundation (ASF) under one -->
	<!--- or more contributor license agreements. See the NOTICE file -->
	<!--- distributed with this work for additional information -->
	<!--- regarding copyright ownership. The ASF licenses this file -->
	<!--- to you under the Apache License, Version 2.0 (the -->
	<!--- "License"); you may not use this file except in compliance -->
	<!--- with the License. You may obtain a copy of the License at -->

	<!--- http://www.apache.org/licenses/LICENSE-2.0 -->

	<!--- Unless required by applicable law or agreed to in writing, -->
	<!--- software distributed under the License is distributed on an -->
	<!--- "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY -->
	<!--- KIND, either express or implied. See the License for the -->
	<!--- specific language governing permissions and limitations -->
	<!--- under the License. -->

	MShadow Documentation
	=====
	This is the documentation for mshadow: A Lightweight CPU/GPU Matrix/Tensor Template Library in C++/CUDA.

	### Links to Topics

	* [Tutorial](../guide)
	* [API Documentation](http://homes.cs.washington.edu/~tqchen/mshadow/doc)
	- You can run ```./mkdoc.sh``` to make the document locally
	* [Tutorial about Expression Template](../guide/exp-template)
	* [Writing Multi-GPU and Distributed ML](../guide/mshadow-ps)
	* [Compile Configuration script](../make)
	* [Expression API](#expression-api)
	- Expression api introduces the concept of expression in mshadow

	Expression API
	=====
	Expression is the key concept in mshadow, a common operation of mshadow is ```tensor = some code to construct expression```

	There are three major types of expression:
	* Mapper expression: only contain element-wise operations of Mapper expressions
	- Mapper expression can used as composition component of other operations.
	- Tensor, scalar are Mapper expressions
	- Example: ``` weight = - eta * (grad + lambda * weight)``` is a Mapper expression.
	- Mapper expressions are translated using expression template code implemented by mshadow.
	- *Assign safety: Element-wise mapping are assign safe, which means, we can write ```A = A 2 + B```, making lvalue appear in expression, the results are still correct.
	* Chainer expression: may contain element-wise operation such as reduction and broadcast
	- Example: ```dst = mirror(src)``` is a chainer expression
	- *Assign safety*: Most of the chainer extensions are not assignment safe, which means user should avoid putting target in source epression.
	* Complex expression: complex operations, need special translation rule to translate to specific implementations.
	- Complex expression can not be used as composition component of other operations.
	- Example: ``` dot(lhs.T(), rhs)```, is complex expression, we can not write
	``` dst = 1.0 + dot(lhs.T(), rhs)```
	- But limited syntax is supported depending on specification, for example, we do support ``` dst += 2.0f * dot(lhs.T(), rhs)```
	- Complex expressions are translated into specific implementations such as BLAS.

	### Element-wise Operations
	The basic binary operators are overloaded to composite Mapper expressions, so we can write
	```c++
	weight = (-eta) * (grad + lambda * weight);
	```
	We can also use customized binary operators, and unary operators:
	```c++
	struct maximum {
	MSHADOW_XINLINE static float Map(float a, float b) {
	return a > b ? a : b;
	}
	};
	template<typename xpu>
	void ExampleMaximum(Tensor<xpu, 2> out,
	const Tensor<xpu, 2> &A,
	const Tensor<xpu, 2> &B) {
	out= 10.0f * F<maximum>(A+1.0f, B);
	}
	struct sigmoid {
	MSHADOW_XINLINE static float Map(float a) {
	return 1.0f/(1.0f+expf(-a));
	}
	};
	template<typename xpu>
	void ExampleSigmoid(Tensor<xpu, 2> out, const Tensor<xpu, 2> &in) {
	// equivalent to out = sigmoid(in*2) + 1;
	out = F<op::plus>(F<sigmoid>(in * 2.0f), ScalarExp(1.0f));
	}
	```
	### Matrix Multiplications
	Matrix multiplications are supported by following syntax, with things brackets [] are optional
	```
	dst <sv> [scale*] dot(lhs [.T()] , rhs [.T()]), <sv> can be =,+=,-=
	```
	Example:
	```c++
	template<typename xpu>
	void Backprop(Tensor<xpu, 2> gradin,
	const Tensor<xpu, 2> &gradout,
	const Tensor<xpu, 2> &netweight) {
	gradin = 2.0 * dot(gradout, netweight.T());
	}
	```

	### Introducing Expression Extensions
	Naming conventions:
	* ```Tensor<xpu, dim>``` to refer to any Tensor with device any device and dimension.
	* ```xpu```, ```dim```, are implicit template parameters.
	* ```Expr<xpu, dim>``` will be used to refer to any mapper expression with type ```Tensor<xpu,dim>```.

	List of functions:
	* [reshape](#reshape): reshapes a tensor to another shape, number of content must be same
	* [broadcast<?>](#broadcast): replicate a 1 dimension tensor in certain dimension
	* [repmat](#repmat), special case of broadcast<0>: repeat vector over rows to form a matrix
	* [sumall_except_dim<?>](#sumall_except_dim): sum over all the dimensions, except the dimension specified in template parameter
	* [sum_rows](#sum_rows): special case of sumall_except_dim<0>, sum of rows in the matrix
	* [unpack_patch2col](#unpack_patch2col): unpack local (overlap) patches of image to column of mat, can be used to implement convolution
	* [pack_col2patch](#pack_col2patch): reverse operation of unpack_patch2col, can be used to implement deconvolution
	* [pool](#pool): do pooling on image
	* [unpool](#unpool): get gradient of pooling result
	* [crop](#crop): crop the original image to a smaller size
	* [mirror](#mirror): get the mirrored result of input expression

	======
	##### reshape
	* ```reshape(Expr<xpu,dim> src, Shape<dimdst> oshape)```
	* reshapes a tensor to another shape, total number of elements must be same
	* parameters:
	- src: input data
	- oshape: target shape
	* result expression type: ```Tensor<xpu, dimdst>``` with ```shape=oshape```, is Mapper expression
	```c++
	void ExampleReshape(void) {
	Tensor<cpu, 2> dst = NewTensor<cpu>(Shape2(4, 5));
	Tensor<cpu, 1> src = NewTensor<cpu>(Shape1(20), 1.0f);
	dst = reshape(src, dst.shape_);
	...
	}
	```
	======

	##### broadcast
	* ```broadcast<dimcast>(Tensor<xpu,1> src, Shape<dimdst> oshape)```
	* replicate a 1 dimension tensor certain dimension, specified by template parameter dimcast
	* parameters:
	- src: input 1 dimensional tensor
	- oshape: shape of output
	* return expression type: ```Tensor<xpu, dimdst>```, ```shape = oshape```, is Chainer expression
	```c++
	void ExampleBroadcast(void) {
	Tensor<cpu, 2> dst = NewTensor<cpu>(Shape2(2, 3));
	Tensor<cpu, 1> src = NewTensor<cpu>(Shape1(2), 1.0f);
	src[0] = 2.0f; src[1] = 1.0f;
	dst = broadcast<0>(src, dst.shape_);
	// dst[0][0] = 2, dst[0][1] = 2; dst[1][0]=1, dst[1][1] = 1
	...
	}
	```
	======
	##### repmat
	* ```repmat(Tensor<xpu, 1> src, int nrows) ```
	* special case of broadcast, repeat 1d tensor over rows
	* input parameters:
	- src: input vector
	- nrows: number of rows in target
	* return expression type: ```Tensor<xpu, 2>```, with ```shape=(nrows, src.size(0))```, is Chainer expression
	```c++
	void ExampleRepmat(void) {
	Tensor<cpu,2> dst = NewTensor<cpu>(Shape2(3, 2));
	Tensor<cpu,1> src = NewTensor<cpu>(Shape1(2), 1.0f);
	src[0] = 2.0f; src[1] = 1.0f;
	dst = repmat(src, 3);
	// dst[0][0] = 2, dst[0][1] = 1; dst[1][0]=2, dst[1][1] = 1
	...
	}
	```
	======
	##### sumall_except_dim
	* ```sumall_except_dim<dimkeep>(Expr<xpu,dim> src) ```
	* sum over all dimensions, except dimkeep
	* input parameters:
	- src: input mapper expression
	* return expression type: ```Tensor<xpu, 1>```, with ```shape=(src.size(dimkeep))```, is Complex expression
	* Syntax: ```dst [sv] [scale] sumall_except_dim<dimkeep>(src) , <sv> can be =, +=, -=, =, /=````
	```c++
	void ExampleSumAllExceptDim(void) {
	Tensor<cpu,3> src = NewTensor<cpu>(Shape3(2, 3, 2), 1.0f);
	Tensor<cpu,1> dst = NewTensor<cpu>(Shape1(3), 1.0f);
	dst += sum_all_except<1>(src * 2.0f);
	// dst[0] = 1.0 + 4.0 *2.0 = 9.0
	...
	}
	```
	======
	##### sum_rows
	* ```sum_rows(Expr<xpu, 2> src) ```
	* sum of rows in the matrix
	* input parameters:
	- src: input mapper expression
	* return expression type: ```Tensor<xpu,1>```, with ```shape=(src.size(0))```, is Complex expression
	* Syntax: ```dst [sv] [scale] sum_rows(src) , <sv> can be =,+=,-=,=,/=````
	```c++
	void ExampleSumRows(void) {
	Tensor<cpu, 2> src = NewTensor<cpu>(Shape2(3, 2), 1.0f);
	Tensor<cpu, 1> dst = NewTensor<cpu>(Shape1(2), 1.0f);
	dst += sum_rows(src + 1.0f);
	// dst[0] = 1.0 + 3.0 *(1.0+1.0) = 7.0
	...
	}
	```
	======
	##### unpack_patch2col
	* ```unpack_patch2col(Expr<xpu,3> img, int psize_y, int p_size_x, int pstride) ```
	* unpack local (overlap) patches of image to column of mat, can be used to implement convolution, after getting unpacked mat, we can use: ```output = dot(weight, mat)``` to get covolved results, the relations:
	- weight; shape[0]: out_channel, shape[1]: ichannel * psize_y * psize_x
	- output; shape[0]: out_channel, shape[1]: out_height * out_width * num_of_images
	- out_height = (in_height - psize_y) / pstride + 1, this means we pad inperfect patch with 0
	- out_width = (in_width - psize_x) / pstride + 1
	* input parameters:
	- img: source image, can be expression; (in_channels, in_height, in_width)
	- psize_y height of each patch
	- psize_x width of each patch
	- pstride: stride of each patch
	* return expression type: ```Tensor<xpu, 2>```, with ```shape=(in_channelpsize_xpsize_y, out_height*out_width)```, is Chainer expression
	```c++
	void ExampleCovolution(Tensor<cpu, 3> dst, Tensor<cpu, 3> src,
	Tensor<cpu, 2> weight, int ksize, int stride) {
	int o_height = (src.size(1)- ksize) / stride + 1;
	int o_width = (src.size(2)- ksize) / stride + 1;
	utils::Assert(weight.size(1) == src.size(0) * ksize * ksize);
	TensorContainer<cpu, 2> tmp_col(Shape2(src.size(0) * ksize * ksize,
	o_height * o_width));
	TensorContainer<cpu, 2> tmp_dst(Shape2(weight.size(0),
	o_height * o_width));
	tmp_col = unpack_patch2col(src, ksize, ksize, stride);
	tmp_dst = dot(weight, tmp_col);
	dst = reshape(tmp_dst, dst.shape_);
	}
	```

	======
	##### pack_col2patch
	* ```pack_col2patch(Tensor<xpu, 2> mat, Shape<3> imshape, int psize_y, int psize_x, int pstride) ````
	* reverse operation of unpack_patch2col, can be used to implement deconvolution
	* input parameters:
	- mat: source mat, same shape as output of unpack_patch2col
	- imshape: shape of target image
	- psize_y height of each patch
	- psize_x width of each patch
	- pstride: stride of each patch
	* return expression type: ```Tensor<xpu, 3>```, with ```shape = imshape```, is Chainer expression
	```c++
	void ExampleDecovolution(Tensor<cpu, 3> bottom, Tensor<cpu, 3> top,
	Tensor<cpu, 2> weight, int ksize, int stride) {
	int o_height = (bottom.size(1)- ksize) / stride + 1;
	int o_width = (bottom.size(2)- ksize) / stride + 1;
	utils::Assert(weight.size(1) == bottom.size(0) * ksize * ksize);
	TensorContainer<cpu, 2> tmp_col(Shape2(bottom.size(0) * ksize * ksize,
	o_height * o_width));
	TensorContainer<cpu, 2> tmp_dst(Shape2(weight.size(0), o_height*o_width));
	tmp_dst = reshape(top, tmp_dst.shape_);
	tmp_col = dot(weight.T(), tmp_dst);
	bottom = pack_col2patch(tmp_col, bottom.shape_, ksize, ksize, stride);
	}
	```

	======
	##### pool
	* ```pool<Reducer>(Expr<xpu, dim> img, [Shape<2> pshape,] int ksize_y, int ksize_x, int kstride)```
	* Pooling on image with specify kernel size and stride, can be used to implement max pooilng and other pooling layer
	* input parameters:
	- Reducer: operation can be max or sum
	- img: source image, can be expression; (in_channels, in_height, in_width)
	- [optional] Shape<2> pshape, output shape
	- ksize_y height of each patch
	- ksize_x width of each patch
	- kstride: stride of each patch
	* return expression: ```Expr<xpu, dim>```, with ```shape = (in_channel, (out_height - ksize) / kstride + 1, (out_width - ksize) / kstride + 1)```, or expression in pshape
	- Chainer expression
	```c++
	void ExampleMaxPooling(TensorContainer<cpu, 3> &data, int ksize, int stride) {
	TensorContainer<cpu, 3> pooled(Shape3(data.size(0),
	(data.size(2) - ksize) / kstride + 1),
	(data.size(1) - ksize) / kstride + 1));
	pooled = pool<red::maximum>(data, ksize, ksize, stride);
	}
	```

	======
	##### unpool
	* ```unpool<Reducer>(Tensor<xpu, 4> data_src, Tensor<xpu, 4> data_pooled, Tensor<xpu, 4> grad_pooled, int ksize_y, int ksize_x, int kstride)```
	* Unpooling on image with specify kernel size and stride, can be used to implement backprop of max pooilng and other pooling layer
	* input parameters:
	- Reducer: operation can be max or sum
	- data_src: source image batch.
	- data_pooled: pooled image batch.
	- grad_pooled: gradient of upper layer
	- ksize_y height of each patch
	- ksize_x width of each patch
	- kstride: stride of each patch
	* return:
	Expression, same shape to data_src
	```c++
	void ExampleMaxUnpooling(Tensor<cpu, 4> &data_src, Tensor<cpu, 4> &data_pooled,
	Tensor<cpu, 4> &grad_pooled, int ksize, int kstride) {
	TensorContainer<cpu, 4> grad(data_src.shape_);
	grad = unpool<red::maximum>(data_src, data_pooled,
	grad_pooled, ksize, ksize, kstride);
	}
	```

	======
	##### crop
	* ```crop(Expr<xpu, dim> src, Shape<2> oshape, int start_height, int start_width)```
	* input parameters:
	- src: input expression
	- oshape: output shape after crop
	- start_height: start height for cropping
	- start_width: start width for cropping
	* Can also be ```crop(Expr<xpu, dim> src, Shape<2> oshape)``` where the crop will happen in center.
	* return
	- cropped expression
	```c++
	void ExampleCrop(TensorContainer<cpu, 3> img, int start_height, int start_width) {
	TensorContainer<cpu> cropped(Shape3(img.size(0),
	img.size(1) - start_height,
	img.size(2) - start_width));
	cropped = crop(img, start_height, start_width);
	}
	```

	======
	##### mirror
	* ```mirrow(Expr<xpu, dim> src)```
	* input:
	- src, source expression to be mirrored
	* output:
	- expression of mirrored result
	```c++
	void ExampleMirror(TensorContainer<cpu, 3> img) {
	TensorContainer<cpu> mirrored(img.shape_);
	mirrored = mirror(img);
	}
	```