blob: f45d7d76d20ac0c2385eae5733f017a97319fc1f [file] [view]
### NNPACK for Multi-Core CPU Support in MXNet
[NNPACK](https://github.com/Maratyszcza/NNPACK) is an acceleration package
for neural network computations, which can run on x86-64, ARMv7, or ARM64 architecture CPUs.
Using NNPACK, higher-level libraries like _MXNet_ can speed up
the execution on multi-core CPU computers, including laptops and mobile devices.
_MXNet_ supports NNPACK for forward propagation (inference only) in convolution, max-pooling, and fully-connected layers.
In this document, we give a high level overview of how to use NNPACK with _MXNet_.
### Conditions
The underlying implementation of NNPACK utilizes several acceleration methods,
including [fft](https://arxiv.org/abs/1312.5851) and [winograd](https://arxiv.org/abs/1509.09308).
These algorithms work better on some special `batch size`, `kernel size`, and `stride` settings than on other,
so depending on the context, not all convolution, max-pooling, or fully-connected layers can be powered by NNPACK.
When favorable conditions for running NNPACKS are not met,
_MXNet_ will fall back to the default implementation automatically.
NNPACK only supports Linux and OS X systems. Windows is not supported at present.
The following table explains under which conditions NNPACK will work.
| operation | conditions |
|:--------- |:---------- |
|convolution |2d convolution `and` no-bias=False `and` dilate=(1,1) `and` num_group=1 `and` batch-size = 1 or batch-size > 1 && stride = (1,1);|
|pooling | max-pooling `and` kernel=(2,2) `and` stride=(2,2) `and` pooling_convention=full |
|fully-connected| without any restrictions |
### Build/Install NNPACK with MXNet
If the trained model meets some conditions of using NNPACK,
you can build MXNet with NNPACK support.
Follow these simple steps:
* Install NNPACK following their documentation on [GitHub](https://github.com/Maratyszcza/NNPACK#building). Note, you need ninja to build NNPACK. Make sure to add `--enable-shared` when running configure.py (i.e. `python configure.py --enable-shared`), because _MXNet_ will link NNPACK dynamically.
* Set lib path of NNPACK as the environment variable, e.g. `export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_NNPACK_INSTALL_PATH/lib`
* Add the include file of NNPACK and its third-party to `ADD_CFLAGS` in config.mk, e.g. `ADD_CFLAGS = -I$(YOUR_NNPACK_INSTALL_PATH)/include -I$(YOUR_NNPACK_INSTALL_PATH)/pthreadpool/include`
* Set `USE_NNPACK = 1` in config.mk.
* [build MXNet](http://mxnet.io/get_started/setup.html#overview).
### NNPACK Performance
Though not all convolutional, pooling, and fully-connected layers can make full use of NNPACK,
for some popular models it provides significant speedups. These include the most popular image recognition networks: Alexnet, VGG, and Inception-bn.
To benchmark NNPACK, we use `example/image-classification/benchmark_score.py`(changed with more range of batch-size). We use CPU e5-2670, MXNET_CPU_NNPACK_NTHREADS=4.
build MXNet without NNPACK, the log is:
```
INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size 1, image/sec: 6.389429
INFO:root:batch size 2, image/sec: 7.961457
INFO:root:batch size 4, image/sec: 8.950112
INFO:root:batch size 8, image/sec: 9.578176
INFO:root:batch size 16, image/sec: 9.701248
INFO:root:batch size 32, image/sec: 9.839940
INFO:root:batch size 64, image/sec: 10.075369
INFO:root:batch size 128, image/sec: 10.053556
INFO:root:batch size 256, image/sec: 9.972228
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size 1, image/sec: 1.223822
INFO:root:batch size 2, image/sec: 1.322814
INFO:root:batch size 4, image/sec: 1.383586
INFO:root:batch size 8, image/sec: 1.402376
INFO:root:batch size 16, image/sec: 1.415972
INFO:root:batch size 32, image/sec: 1.428377
INFO:root:batch size 64, image/sec: 1.443987
INFO:root:batch size 128, image/sec: 1.427531
INFO:root:batch size 256, image/sec: 1.435279
```
build MXNet with NNPACK, log is:
```
INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size 1, image/sec: 19.027215
INFO:root:batch size 2, image/sec: 12.879975
INFO:root:batch size 4, image/sec: 17.424076
INFO:root:batch size 8, image/sec: 21.283966
INFO:root:batch size 16, image/sec: 24.469325
INFO:root:batch size 32, image/sec: 25.910348
INFO:root:batch size 64, image/sec: 27.441672
INFO:root:batch size 128, image/sec: 28.009156
INFO:root:batch size 256, image/sec: 28.918950
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size 1, image/sec: 3.980907
INFO:root:batch size 2, image/sec: 2.392069
INFO:root:batch size 4, image/sec: 3.610553
INFO:root:batch size 8, image/sec: 4.994450
INFO:root:batch size 16, image/sec: 6.396612
INFO:root:batch size 32, image/sec: 7.614288
INFO:root:batch size 64, image/sec: 8.826084
INFO:root:batch size 128, image/sec: 9.193653
INFO:root:batch size 256, image/sec: 9.991472
```
The results show that NNPACK can confer a speedup of about 2X~7X as compared to the original _MXNet_ CPU implementation.
### Tips
NNPACK aims to provide high-performance implementations of some layers for multi-core CPUs, so you can easily set the thread number by changing the environmental variable `MXNET_CPU_NNPACK_NTHREADS`. However, we found that the performance is not proportional to the number of threads, and suggest using 4~8 threads when using NNPACK.