| ### NNPACK for Multi-Core CPU Support in MXNet | 
 | [NNPACK](https://github.com/Maratyszcza/NNPACK) is an acceleration package | 
 | for neural network computations, which can run on x86-64, ARMv7, or ARM64 architecture CPUs. | 
 | Using NNPACK, higher-level libraries like _MXNet_ can speed up | 
 | the execution on multi-core CPU computers, including laptops and mobile devices. | 
 |  | 
 | _MXNet_ supports NNPACK for forward propagation (inference only) in convolution, max-pooling, and fully-connected layers. | 
 | In this document, we give a high level overview of how to use NNPACK with _MXNet_. | 
 |  | 
 |  | 
 | ### Conditions | 
 | The underlying implementation of NNPACK utilizes several acceleration methods, | 
 | including [fft](https://arxiv.org/abs/1312.5851) and [winograd](https://arxiv.org/abs/1509.09308). | 
 | These algorithms work better on some special `batch size`, `kernel size`, and `stride` settings than on other, | 
 | so depending on the context, not all convolution, max-pooling, or fully-connected layers can be powered by NNPACK. | 
 | When favorable conditions for running NNPACKS are not met, | 
 | _MXNet_ will fall back to the default implementation automatically.   | 
 |  | 
 | NNPACK only supports Linux and OS X systems. Windows is not supported at present. | 
 | The following table explains under which conditions NNPACK will work. | 
 |  | 
 | | operation      | conditions | | 
 | |:---------      |:---------- | | 
 | |convolution     |2d convolution `and` no-bias=False `and` dilate=(1,1) `and` num_group=1 `and` batch-size = 1 or batch-size > 1 && stride = (1,1);| | 
 | |pooling         | max-pooling `and` kernel=(2,2) `and` stride=(2,2) `and` pooling_convention=full    | | 
 | |fully-connected| without any restrictions | | 
 |  | 
 | ### Build/Install NNPACK with MXNet | 
 |  | 
 | If the trained model meets some conditions of using NNPACK, | 
 | you can build MXNet with NNPACK support. | 
 | Follow these simple steps:   | 
 | * Build NNPACK shared library with the following commands. _MXNet_ will link NNPACK dynamically. | 
 |  | 
 | Note: The following NNPACK installation instructions have been tested on Ubuntu 14.04 and 16.04. | 
 |  | 
 | ```bash | 
 |  | 
 | # Install Pip | 
 | $ sudo apt-get update | 
 | $ sudo apt-get install -y python-pip | 
 | $ sudo pip install --upgrade pip | 
 |  | 
 | # Install Peach | 
 | $ git clone https://github.com/Maratyszcza/PeachPy.git | 
 | $ cd PeachPy | 
 | $ sudo pip install --upgrade -r requirements.txt | 
 | $ python setup.py generate | 
 | $ sudo pip install --upgrade . | 
 |  | 
 | # Install Ninja Build System | 
 | $ sudo apt-get install ninja-build | 
 | $ pip install ninja-syntax | 
 |  | 
 | # Build NNPack shared library | 
 | $ cd ~ | 
 | $ git clone --recursive https://github.com/Maratyszcza/NNPACK.git | 
 | $ cd NNPACK | 
 | # Latest NNPACK do not support building NNPACK as shared library using --enable-shared flag | 
 | # Reset to commit that supports it. | 
 | $ git reset --hard 9c6747d7b80051b40e6f92d6828e2ed997529cd2 | 
 | $ git submodule init && git submodule update --recursive | 
 | $ python ./configure.py --enable-shared | 
 | $ ninja | 
 | $ cd ~ | 
 |  | 
 | ``` | 
 |  | 
 | * Set lib path of NNPACK as the environment variable, e.g. `export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_NNPACK_INSTALL_PATH/lib` | 
 | * Add the include file of NNPACK and its third-party to  `ADD_CFLAGS` in config.mk, e.g. `ADD_CFLAGS = -I$(YOUR_NNPACK_INSTALL_PATH)/include/ -I$(YOUR_NNPACK_INSTALL_PATH)/third-party/pthreadpool/include/` | 
 | * Set `USE_NNPACK = 1` in config.mk. | 
 | * Build MXNet from source following the [install guide](http://mxnet.io/get_started/install.html). | 
 |  | 
 | ### NNPACK Performance | 
 |  | 
 | Though not all convolutional, pooling, and fully-connected layers can make full use of NNPACK, | 
 | for some popular models it provides significant speedups. These include the most popular image recognition networks: Alexnet, VGG, and Inception-bn. | 
 |  | 
 | To benchmark NNPACK, we use `example/image-classification/benchmark_score.py`(changed with  more range of batch-size). We use CPU e5-2670, MXNET_CPU_NNPACK_NTHREADS=4. | 
 |  | 
 | build MXNet without NNPACK, the log is: | 
 | ``` | 
 | INFO:root:network: alexnet | 
 | INFO:root:device: cpu(0) | 
 | INFO:root:batch size  1, image/sec: 6.389429 | 
 | INFO:root:batch size  2, image/sec: 7.961457 | 
 | INFO:root:batch size  4, image/sec: 8.950112 | 
 | INFO:root:batch size  8, image/sec: 9.578176 | 
 | INFO:root:batch size 16, image/sec: 9.701248 | 
 | INFO:root:batch size 32, image/sec: 9.839940 | 
 | INFO:root:batch size 64, image/sec: 10.075369 | 
 | INFO:root:batch size 128, image/sec: 10.053556 | 
 | INFO:root:batch size 256, image/sec: 9.972228 | 
 | INFO:root:network: vgg | 
 | INFO:root:device: cpu(0) | 
 | INFO:root:batch size  1, image/sec: 1.223822 | 
 | INFO:root:batch size  2, image/sec: 1.322814 | 
 | INFO:root:batch size  4, image/sec: 1.383586 | 
 | INFO:root:batch size  8, image/sec: 1.402376 | 
 | INFO:root:batch size 16, image/sec: 1.415972 | 
 | INFO:root:batch size 32, image/sec: 1.428377 | 
 | INFO:root:batch size 64, image/sec: 1.443987 | 
 | INFO:root:batch size 128, image/sec: 1.427531 | 
 | INFO:root:batch size 256, image/sec: 1.435279 | 
 | ``` | 
 |  | 
 | build MXNet with NNPACK, log is: | 
 |  | 
 | ``` | 
 | INFO:root:network: alexnet | 
 | INFO:root:device: cpu(0) | 
 | INFO:root:batch size  1, image/sec: 19.027215 | 
 | INFO:root:batch size  2, image/sec: 12.879975 | 
 | INFO:root:batch size  4, image/sec: 17.424076 | 
 | INFO:root:batch size  8, image/sec: 21.283966 | 
 | INFO:root:batch size 16, image/sec: 24.469325 | 
 | INFO:root:batch size 32, image/sec: 25.910348 | 
 | INFO:root:batch size 64, image/sec: 27.441672 | 
 | INFO:root:batch size 128, image/sec: 28.009156 | 
 | INFO:root:batch size 256, image/sec: 28.918950 | 
 | INFO:root:network: vgg | 
 | INFO:root:device: cpu(0) | 
 | INFO:root:batch size  1, image/sec: 3.980907 | 
 | INFO:root:batch size  2, image/sec: 2.392069 | 
 | INFO:root:batch size  4, image/sec: 3.610553 | 
 | INFO:root:batch size  8, image/sec: 4.994450 | 
 | INFO:root:batch size 16, image/sec: 6.396612 | 
 | INFO:root:batch size 32, image/sec: 7.614288 | 
 | INFO:root:batch size 64, image/sec: 8.826084 | 
 | INFO:root:batch size 128, image/sec: 9.193653 | 
 | INFO:root:batch size 256, image/sec: 9.991472 | 
 | ``` | 
 |  | 
 | The results show that NNPACK can confer a speedup of about 2X~7X as compared to the original _MXNet_ CPU implementation. | 
 |  | 
 | ### Tips | 
 |  | 
 | NNPACK aims to provide high-performance implementations of some layers for multi-core CPUs, so you can easily set the thread number by changing the environmental variable `MXNET_CPU_NNPACK_NTHREADS`. However, we found that the performance is not proportional to the number of threads, and suggest using 4~8 threads when using NNPACK. |