docs/how_to/nnpack.md

NNPACK for Multi-Core CPU Support in MXNet

NNPACK is an acceleration package for neural network computations, which can run on x86-64, ARMv7, or ARM64 architecture CPUs. Using NNPACK, higher-level libraries like MXNet can speed up the execution on multi-core CPU computers, including laptops and mobile devices.

MXNet supports NNPACK for forward propagation (inference only) in convolution, max-pooling, and fully-connected layers. In this document, we give a high level overview of how to use NNPACK with MXNet.

Conditions

The underlying implementation of NNPACK utilizes several acceleration methods, including fft and winograd. These algorithms work better on some special batch size, kernel size, and stride settings than on other, so depending on the context, not all convolution, max-pooling, or fully-connected layers can be powered by NNPACK. When favorable conditions for running NNPACKS are not met, MXNet will fall back to the default implementation automatically.

NNPACK only supports Linux and OS X systems. Windows is not supported at present. The following table explains under which conditions NNPACK will work.

operation	conditions
convolution	2d convolution `and` no-bias=False `and` dilate=(1,1) `and` num_group=1 `and` batch-size = 1 or batch-size > 1 && stride = (1,1);
pooling	max-pooling `and` kernel=(2,2) `and` stride=(2,2) `and` pooling_convention=full
fully-connected	without any restrictions

Build/Install NNPACK with MXNet

If the trained model meets some conditions of using NNPACK, you can build MXNet with NNPACK support. Follow these simple steps:

Install NNPACK following their documentation on GitHub. Note, you need ninja to build NNPACK. Make sure to add --enable-shared when running configure.py (i.e. python configure.py --enable-shared), because MXNet will link NNPACK dynamically.
Set lib path of NNPACK as the environment variable, e.g. export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$YOUR_NNPACK_INSTALL_PATH/lib
Add the include file of NNPACK and its third-party to ADD_CFLAGS in config.mk, e.g. ADD_CFLAGS = -I$(YOUR_NNPACK_INSTALL_PATH)/include -I$(YOUR_NNPACK_INSTALL_PATH)/pthreadpool/include
Set USE_NNPACK = 1 in config.mk.
build MXNet.

NNPACK Performance

Though not all convolutional, pooling, and fully-connected layers can make full use of NNPACK, for some popular models it provides significant speedups. These include the most popular image recognition networks: Alexnet, VGG, and Inception-bn.

To benchmark NNPACK, we use example/image-classification/benchmark_score.py(changed with more range of batch-size). We use CPU e5-2670, MXNET_CPU_NNPACK_NTHREADS=4.

build MXNet without NNPACK, the log is:

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 6.389429
INFO:root:batch size  2, image/sec: 7.961457
INFO:root:batch size  4, image/sec: 8.950112
INFO:root:batch size  8, image/sec: 9.578176
INFO:root:batch size 16, image/sec: 9.701248
INFO:root:batch size 32, image/sec: 9.839940
INFO:root:batch size 64, image/sec: 10.075369
INFO:root:batch size 128, image/sec: 10.053556
INFO:root:batch size 256, image/sec: 9.972228
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 1.223822
INFO:root:batch size  2, image/sec: 1.322814
INFO:root:batch size  4, image/sec: 1.383586
INFO:root:batch size  8, image/sec: 1.402376
INFO:root:batch size 16, image/sec: 1.415972
INFO:root:batch size 32, image/sec: 1.428377
INFO:root:batch size 64, image/sec: 1.443987
INFO:root:batch size 128, image/sec: 1.427531
INFO:root:batch size 256, image/sec: 1.435279

build MXNet with NNPACK, log is:

INFO:root:network: alexnet
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 19.027215
INFO:root:batch size  2, image/sec: 12.879975
INFO:root:batch size  4, image/sec: 17.424076
INFO:root:batch size  8, image/sec: 21.283966
INFO:root:batch size 16, image/sec: 24.469325
INFO:root:batch size 32, image/sec: 25.910348
INFO:root:batch size 64, image/sec: 27.441672
INFO:root:batch size 128, image/sec: 28.009156
INFO:root:batch size 256, image/sec: 28.918950
INFO:root:network: vgg
INFO:root:device: cpu(0)
INFO:root:batch size  1, image/sec: 3.980907
INFO:root:batch size  2, image/sec: 2.392069
INFO:root:batch size  4, image/sec: 3.610553
INFO:root:batch size  8, image/sec: 4.994450
INFO:root:batch size 16, image/sec: 6.396612
INFO:root:batch size 32, image/sec: 7.614288
INFO:root:batch size 64, image/sec: 8.826084
INFO:root:batch size 128, image/sec: 9.193653
INFO:root:batch size 256, image/sec: 9.991472

The results show that NNPACK can confer a speedup of about 2X~7X as compared to the original MXNet CPU implementation.

Tips

NNPACK aims to provide high-performance implementations of some layers for multi-core CPUs, so you can easily set the thread number by changing the environmental variable MXNET_CPU_NNPACK_NTHREADS. However, we found that the performance is not proportional to the number of threads, and suggest using 4~8 threads when using NNPACK.