blob: b17c6ee6cd2810b702619636d4c9f579db9995ad [file] [log] [blame]
### NNPACK for Multi-Core CPU Support in MXNet
[NNPACK](https://github.com/Maratyszcza/NNPACK) is an acceleration package
for neural network computations, which can run on x86-64, ARMv7, or ARM64 architecture CPUs.
Using NNPACK, higher-level libraries like _MXNet_ can speed up
the execution on multi-core CPU computers, including laptops and mobile devices.
_MXNet_ supports NNPACK for forward propagation (inference only) in convolution, max-pooling, and fully-connected layers.
In this document, we give a high level overview of how to use NNPACK with _MXNet_.
### Conditions
The underlying implementation of NNPACK utilizes several acceleration methods,
including [fft](https://arxiv.org/abs/1312.5851) and [winograd](https://arxiv.org/abs/1509.09308).
These algorithms work better on some special `batch size`, `kernel size`, and `stride` settings than on other,
so depending on the context, not all convolution, max-pooling, or fully-connected layers can be powered by NNPACK.
When favorable conditions for running NNPACKS are not met,
_MXNet_ will fall back to the default implementation automatically.
NNPACK only supports Linux and OS X systems. Windows is not supported at present.
The following table explains under which conditions NNPACK will work.
| operation | conditions |
|:--------- |:---------- |
|convolution |2d convolution `and` no-bias=False `and` dilate=(1,1) `and` num_group=1 `and` batch-size = 1 or batch-size > 1 && stride = (1,1);|
|pooling | max-pooling `and` kernel=(2,2) `and` stride=(2,2) `and` pooling_convention=full |
|fully-connected| without any restrictions |
### Build/Install NNPACK with MXNet
If the trained model meets some conditions of using NNPACK,
you can build MXNet with NNPACK support.
Follow these simple steps:
* Build NNPACK shared library with the following commands. _MXNet_ will link NNPACK dynamically.
Note: The following NNPACK installation instructions have been tested on Ubuntu 14.04 and 16.04.
```bash
# Install Pip
$ sudo apt-get update
$ sudo apt-get install -y python-pip
$ sudo pip install --upgrade pip
# Install Peach
$ git clone https://github.com/Maratyszcza/PeachPy.git
$ cd PeachPy
$ sudo pip install --upgrade -r requirements.txt
$ python setup.py generate
$ sudo pip install --upgrade .
# Install Ninja Build System
$ sudo apt-get install ninja-build
$ pip install ninja-syntax
# Build NNPack shared library
$ cd ~
$ git clone --recursive https://github.com/Maratyszcza/NNPACK.git
$ cd NNPACK
# Latest NNPACK do not support building NNPACK as shared library using --enable-shared flag
# Reset to commit that supports it.
$ git reset --hard 9c6747d7b80051b40e6f92d6828e2ed997529cd2
$ git submodule init && git submodule update --recursive
$ python ./configure.py --enable-shared
$ ninja
$ cd ~
```
* Set lib path of NNPACK as the environment variable, e.g. `export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_NNPACK_INSTALL_PATH/lib`
* Add the include file of NNPACK and its third-party to `ADD_CFLAGS` in config.mk, e.g. `ADD_CFLAGS = -I$(YOUR_NNPACK_INSTALL_PATH)/include/ -I$(YOUR_NNPACK_INSTALL_PATH)/third-party/pthreadpool/include/`
* Set `USE_NNPACK = 1` in config.mk.
* Build MXNet from source following the [install guide](http://mxnet.io/get_started/install.html).
### NNPACK Performance
Though not all convolutional, pooling, and fully-connected layers can make full use of NNPACK,
for some popular models it provides significant speedups. These include the most popular image recognition networks: Alexnet, VGG, and Inception-bn.
To benchmark NNPACK, we use `example/image-classification/benchmark_score.py`(changed with more range of batch-size). We use CPU e5-2670, MXNET_CPU_NNPACK_NTHREADS=4.
build MXNet without NNPACK, the log is:
```
INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size 1, image/sec: 6.389429
INFO:root:batch size 2, image/sec: 7.961457
INFO:root:batch size 4, image/sec: 8.950112
INFO:root:batch size 8, image/sec: 9.578176
INFO:root:batch size 16, image/sec: 9.701248
INFO:root:batch size 32, image/sec: 9.839940
INFO:root:batch size 64, image/sec: 10.075369
INFO:root:batch size 128, image/sec: 10.053556
INFO:root:batch size 256, image/sec: 9.972228
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size 1, image/sec: 1.223822
INFO:root:batch size 2, image/sec: 1.322814
INFO:root:batch size 4, image/sec: 1.383586
INFO:root:batch size 8, image/sec: 1.402376
INFO:root:batch size 16, image/sec: 1.415972
INFO:root:batch size 32, image/sec: 1.428377
INFO:root:batch size 64, image/sec: 1.443987
INFO:root:batch size 128, image/sec: 1.427531
INFO:root:batch size 256, image/sec: 1.435279
```
build MXNet with NNPACK, log is:
```
INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size 1, image/sec: 19.027215
INFO:root:batch size 2, image/sec: 12.879975
INFO:root:batch size 4, image/sec: 17.424076
INFO:root:batch size 8, image/sec: 21.283966
INFO:root:batch size 16, image/sec: 24.469325
INFO:root:batch size 32, image/sec: 25.910348
INFO:root:batch size 64, image/sec: 27.441672
INFO:root:batch size 128, image/sec: 28.009156
INFO:root:batch size 256, image/sec: 28.918950
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size 1, image/sec: 3.980907
INFO:root:batch size 2, image/sec: 2.392069
INFO:root:batch size 4, image/sec: 3.610553
INFO:root:batch size 8, image/sec: 4.994450
INFO:root:batch size 16, image/sec: 6.396612
INFO:root:batch size 32, image/sec: 7.614288
INFO:root:batch size 64, image/sec: 8.826084
INFO:root:batch size 128, image/sec: 9.193653
INFO:root:batch size 256, image/sec: 9.991472
```
The results show that NNPACK can confer a speedup of about 2X~7X as compared to the original _MXNet_ CPU implementation.
### Tips
NNPACK aims to provide high-performance implementations of some layers for multi-core CPUs, so you can easily set the thread number by changing the environmental variable `MXNET_CPU_NNPACK_NTHREADS`. However, we found that the performance is not proportional to the number of threads, and suggest using 4~8 threads when using NNPACK.