docs/how_to/nnpack.md - mxnet-test - Git at Google

 ### NNPACK for Multi-Core CPU Support in MXNet
 [NNPACK](https://github.com/Maratyszcza/NNPACK) is an acceleration package
 for neural network computations, which can run on x86-64, ARMv7, or ARM64 architecture CPUs.
 Using NNPACK, higher-level libraries like _MXNet_ can speed up
 the execution on multi-core CPU computers, including laptops and mobile devices.

 _MXNet_ supports NNPACK for forward propagation (inference only) in convolution, max-pooling, and fully-connected layers.
 In this document, we give a high level overview of how to use NNPACK with _MXNet_.


 ### Conditions
 The underlying implementation of NNPACK utilizes several acceleration methods,
 including [fft](https://arxiv.org/abs/1312.5851) and [winograd](https://arxiv.org/abs/1509.09308).
 These algorithms work better on some special `batch size`, `kernel size`, and `stride` settings than on other,
 so depending on the context, not all convolution, max-pooling, or fully-connected layers can be powered by NNPACK.
 When favorable conditions for running NNPACKS are not met,
 _MXNet_ will fall back to the default implementation automatically.

 NNPACK only supports Linux and OS X systems. Windows is not supported at present.
 The following table explains under which conditions NNPACK will work.

 | operation      | conditions |
 |:---------      |:---------- |
 |convolution     |2d convolution `and` no-bias=False `and` dilate=(1,1) `and` num_group=1 `and` batch-size = 1 or batch-size > 1 && stride = (1,1);|
 |pooling         | max-pooling `and` kernel=(2,2) `and` stride=(2,2) `and` pooling_convention=full    |
 |fully-connected| without any restrictions |

 ### Build/Install NNPACK with MXNet

 If the trained model meets some conditions of using NNPACK,
 you can build MXNet with NNPACK support.
 Follow these simple steps:
 * Install NNPACK following their documentation on [GitHub](https://github.com/Maratyszcza/NNPACK#building). Note, you need ninja to build NNPACK. Make sure to add `--enable-shared` when running configure.py (i.e. `python configure.py --enable-shared`), because _MXNet_ will link NNPACK dynamically.
 * Set lib path of NNPACK as the environment variable, e.g. `export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_NNPACK_INSTALL_PATH/lib`
 * Add the include file of NNPACK and its third-party to  `ADD_CFLAGS` in config.mk, e.g. `ADD_CFLAGS = -I$(YOUR_NNPACK_INSTALL_PATH)/include -I$(YOUR_NNPACK_INSTALL_PATH)/pthreadpool/include`
 * Set `USE_NNPACK = 1` in config.mk.
 * [build MXNet](http://mxnet.io/get_started/setup.html#overview).

 ### NNPACK Performance

 Though not all convolutional, pooling, and fully-connected layers can make full use of NNPACK,
 for some popular models it provides significant speedups. These include the most popular image recognition networks: Alexnet, VGG, and Inception-bn.

 To benchmark NNPACK, we use `example/image-classification/benchmark_score.py`(changed with  more range of batch-size). We use CPU e5-2670, MXNET_CPU_NNPACK_NTHREADS=4.

 build MXNet without NNPACK, the log is:
 ```
 INFO:root:network: alexnet
 INFO:root:device: cpu(0)
 INFO:root:batch size  1, image/sec: 6.389429
 INFO:root:batch size  2, image/sec: 7.961457
 INFO:root:batch size  4, image/sec: 8.950112
 INFO:root:batch size  8, image/sec: 9.578176
 INFO:root:batch size 16, image/sec: 9.701248
 INFO:root:batch size 32, image/sec: 9.839940
 INFO:root:batch size 64, image/sec: 10.075369
 INFO:root:batch size 128, image/sec: 10.053556
 INFO:root:batch size 256, image/sec: 9.972228
 INFO:root:network: vgg
 INFO:root:device: cpu(0)
 INFO:root:batch size  1, image/sec: 1.223822
 INFO:root:batch size  2, image/sec: 1.322814
 INFO:root:batch size  4, image/sec: 1.383586
 INFO:root:batch size  8, image/sec: 1.402376
 INFO:root:batch size 16, image/sec: 1.415972
 INFO:root:batch size 32, image/sec: 1.428377
 INFO:root:batch size 64, image/sec: 1.443987
 INFO:root:batch size 128, image/sec: 1.427531
 INFO:root:batch size 256, image/sec: 1.435279
 ```

 build MXNet with NNPACK, log is:

 ```
 INFO:root:network: alexnet
 INFO:root:device: cpu(0)
 INFO:root:batch size  1, image/sec: 19.027215
 INFO:root:batch size  2, image/sec: 12.879975
 INFO:root:batch size  4, image/sec: 17.424076
 INFO:root:batch size  8, image/sec: 21.283966
 INFO:root:batch size 16, image/sec: 24.469325
 INFO:root:batch size 32, image/sec: 25.910348
 INFO:root:batch size 64, image/sec: 27.441672
 INFO:root:batch size 128, image/sec: 28.009156
 INFO:root:batch size 256, image/sec: 28.918950
 INFO:root:network: vgg
 INFO:root:device: cpu(0)
 INFO:root:batch size  1, image/sec: 3.980907
 INFO:root:batch size  2, image/sec: 2.392069
 INFO:root:batch size  4, image/sec: 3.610553
 INFO:root:batch size  8, image/sec: 4.994450
 INFO:root:batch size 16, image/sec: 6.396612
 INFO:root:batch size 32, image/sec: 7.614288
 INFO:root:batch size 64, image/sec: 8.826084
 INFO:root:batch size 128, image/sec: 9.193653
 INFO:root:batch size 256, image/sec: 9.991472
 ```

 The results show that NNPACK can confer a speedup of about 2X~7X as compared to the original _MXNet_ CPU implementation.

 ### Tips

 NNPACK aims to provide high-performance implementations of some layers for multi-core CPUs, so you can easily set the thread number by changing the environmental variable `MXNET_CPU_NNPACK_NTHREADS`. However, we found that the performance is not proportional to the number of threads, and suggest using 4~8 threads when using NNPACK.
	### NNPACK for Multi-Core CPU Support in MXNet
	[NNPACK](https://github.com/Maratyszcza/NNPACK) is an acceleration package
	for neural network computations, which can run on x86-64, ARMv7, or ARM64 architecture CPUs.
	Using NNPACK, higher-level libraries like _MXNet_ can speed up
	the execution on multi-core CPU computers, including laptops and mobile devices.

	_MXNet_ supports NNPACK for forward propagation (inference only) in convolution, max-pooling, and fully-connected layers.
	In this document, we give a high level overview of how to use NNPACK with _MXNet_.


	### Conditions
	The underlying implementation of NNPACK utilizes several acceleration methods,
	including [fft](https://arxiv.org/abs/1312.5851) and [winograd](https://arxiv.org/abs/1509.09308).
	These algorithms work better on some special `batch size`, `kernel size`, and `stride` settings than on other,
	so depending on the context, not all convolution, max-pooling, or fully-connected layers can be powered by NNPACK.
	When favorable conditions for running NNPACKS are not met,
	_MXNet_ will fall back to the default implementation automatically.

	NNPACK only supports Linux and OS X systems. Windows is not supported at present.
	The following table explains under which conditions NNPACK will work.

	\| operation \| conditions \|
	\|:--------- \|:---------- \|
	\|convolution \|2d convolution `and` no-bias=False `and` dilate=(1,1) `and` num_group=1 `and` batch-size = 1 or batch-size > 1 && stride = (1,1);\|
	\|pooling \| max-pooling `and` kernel=(2,2) `and` stride=(2,2) `and` pooling_convention=full \|
	\|fully-connected\| without any restrictions \|

	### Build/Install NNPACK with MXNet

	If the trained model meets some conditions of using NNPACK,
	you can build MXNet with NNPACK support.
	Follow these simple steps:
	* Install NNPACK following their documentation on [GitHub](https://github.com/Maratyszcza/NNPACK#building). Note, you need ninja to build NNPACK. Make sure to add `--enable-shared` when running configure.py (i.e. `python configure.py --enable-shared`), because _MXNet_ will link NNPACK dynamically.
	* Set lib path of NNPACK as the environment variable, e.g. `export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_NNPACK_INSTALL_PATH/lib`
	* Add the include file of NNPACK and its third-party to `ADD_CFLAGS` in config.mk, e.g. `ADD_CFLAGS = -I$(YOUR_NNPACK_INSTALL_PATH)/include -I$(YOUR_NNPACK_INSTALL_PATH)/pthreadpool/include`
	* Set `USE_NNPACK = 1` in config.mk.
	* [build MXNet](http://mxnet.io/get_started/setup.html#overview).

	### NNPACK Performance

	Though not all convolutional, pooling, and fully-connected layers can make full use of NNPACK,
	for some popular models it provides significant speedups. These include the most popular image recognition networks: Alexnet, VGG, and Inception-bn.

	To benchmark NNPACK, we use `example/image-classification/benchmark_score.py`(changed with more range of batch-size). We use CPU e5-2670, MXNET_CPU_NNPACK_NTHREADS=4.

	build MXNet without NNPACK, the log is:
	```
	INFO:root:network: alexnet
	INFO:root:device: cpu(0)
	INFO:root:batch size 1, image/sec: 6.389429
	INFO:root:batch size 2, image/sec: 7.961457
	INFO:root:batch size 4, image/sec: 8.950112
	INFO:root:batch size 8, image/sec: 9.578176
	INFO:root:batch size 16, image/sec: 9.701248
	INFO:root:batch size 32, image/sec: 9.839940
	INFO:root:batch size 64, image/sec: 10.075369
	INFO:root:batch size 128, image/sec: 10.053556
	INFO:root:batch size 256, image/sec: 9.972228
	INFO:root:network: vgg
	INFO:root:device: cpu(0)
	INFO:root:batch size 1, image/sec: 1.223822
	INFO:root:batch size 2, image/sec: 1.322814
	INFO:root:batch size 4, image/sec: 1.383586
	INFO:root:batch size 8, image/sec: 1.402376
	INFO:root:batch size 16, image/sec: 1.415972
	INFO:root:batch size 32, image/sec: 1.428377
	INFO:root:batch size 64, image/sec: 1.443987
	INFO:root:batch size 128, image/sec: 1.427531
	INFO:root:batch size 256, image/sec: 1.435279
	```

	build MXNet with NNPACK, log is:

	```
	INFO:root:network: alexnet
	INFO:root:device: cpu(0)
	INFO:root:batch size 1, image/sec: 19.027215
	INFO:root:batch size 2, image/sec: 12.879975
	INFO:root:batch size 4, image/sec: 17.424076
	INFO:root:batch size 8, image/sec: 21.283966
	INFO:root:batch size 16, image/sec: 24.469325
	INFO:root:batch size 32, image/sec: 25.910348
	INFO:root:batch size 64, image/sec: 27.441672
	INFO:root:batch size 128, image/sec: 28.009156
	INFO:root:batch size 256, image/sec: 28.918950
	INFO:root:network: vgg
	INFO:root:device: cpu(0)
	INFO:root:batch size 1, image/sec: 3.980907
	INFO:root:batch size 2, image/sec: 2.392069
	INFO:root:batch size 4, image/sec: 3.610553
	INFO:root:batch size 8, image/sec: 4.994450
	INFO:root:batch size 16, image/sec: 6.396612
	INFO:root:batch size 32, image/sec: 7.614288
	INFO:root:batch size 64, image/sec: 8.826084
	INFO:root:batch size 128, image/sec: 9.193653
	INFO:root:batch size 256, image/sec: 9.991472
	```

	The results show that NNPACK can confer a speedup of about 2X~7X as compared to the original _MXNet_ CPU implementation.

	### Tips

	NNPACK aims to provide high-performance implementations of some layers for multi-core CPUs, so you can easily set the thread number by changing the environmental variable `MXNET_CPU_NNPACK_NTHREADS`. However, we found that the performance is not proportional to the number of threads, and suggest using 4~8 threads when using NNPACK.