A better training and inference performance is expected to be achieved on Intel-Architecture CPUs with MXNet built with oneDNN on multiple operating system, including Linux, Windows and MacOS. In the following sections, you will find build instructions for MXNet with oneDNN on Linux, MacOS and Windows.
The detailed performance data collected on Intel Xeon CPU with MXNet built with oneDNN can be found here.
sudo apt-get update sudo apt-get install -y build-essential git sudo apt-get install -y libopenblas-dev liblapack-dev sudo apt-get install -y libopencv-dev sudo apt-get install -y graphviz
git clone --recursive https://github.com/apache/mxnet.git cd mxnet
To achieve better performance, the Intel OpenMP and llvm OpenMP are recommended as below instruction. Otherwise, default GNU OpenMP will be used and you may get the sub-optimal performance. If you don't have the full MKL library installation, you might use OpenBLAS as the blas library, by setting USE_BLAS=Open.
# build with llvm OpenMP and Intel MKL/OpenBlas mkdir build && cd build cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_OPENMP=ON -DUSE_OPENCV=ON .. make -j $(nproc)
# build with Intel MKL and Intel OpenMP mkdir build && cd build cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_BLAS=mkl .. make -j $(nproc)
# build with openblas and GNU OpenMP (sub-optimal performance) mkdir build && cd build cmake -DUSE_CUDA=OFF -DUSE_ONEDNN=ON -DUSE_BLAS=Open .. make -j $(nproc)
Install the dependencies, required for MXNet, with the following commands:
# Paste this command in Mac terminal to install Homebrew /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" # install dependency brew update brew install pkg-config brew install graphviz brew tap homebrew/core brew install opencv brew tap homebrew/versions brew install llvm
git clone --recursive https://github.com/apache/mxnet.git cd mxnet
LIBRARY_PATH=$(brew --prefix llvm)/lib/ make -j $(sysctl -n hw.ncpu) CC=$(brew --prefix llvm)/bin/clang CXX=$(brew --prefix llvm)/bin/clang++ USE_OPENCV=1 USE_OPENMP=1 USE_ONEDNN=1 USE_BLAS=apple
On Windows, you can use Micrsoft Visual Studio 2015 and Microsoft Visual Studio 2017 to compile MXNet with oneDNN. Micrsoft Visual Studio 2015 is recommended.
Visual Studio 2015
To build and install MXNet yourself, you need the following dependencies. Install the required dependencies:
OpenCV_DIR
to point to the OpenCV build directory
(e.g.,OpenCV_DIR = C:\opencv\build
). Also, add the OpenCV bin directory (C:\opencv\build\x64\vc14\bin
for example) to the PATH
variable.MKLROOT
environment variable to point to MKL
directory that contains the include
and lib
. If you want to use MKL blas, you should set -DUSE_BLAS=mkl
when cmake. Typically, you can find the directory in C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl
.mingw64.dll.zip
along with openBLAS and add them to PATH.OpenBLAS_HOME
to point to the OpenBLAS
directory that contains the include
and lib
directories. Typically, you can find the directory in C:\Downloads\OpenBLAS\
.After you have installed all of the required dependencies, build the MXNet source code:
git clone --recursive https://github.com/apache/mxnet.git cd C:\mxent
./build
. Make sure to specify the architecture in the command:>mkdir build >cd build >cmake -G "Visual Studio 14 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=Open -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release
>"C:\Program Files (x86)\IntelSWTools\compilers_and_libraries\windows\mkl\bin\mklvars.bat" intel64 >cmake -G "Visual Studio 14 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=mkl -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release
.sln
and compile it, or compile the MXNet source code by using following command:msbuild mxnet.sln /p:Configuration=Release;Platform=x64 /maxcpucount
These commands produce mxnet library called libmxnet.dll
in the ./build/Release/
or ./build/Debug
folder. Also libmkldnn.dll
with be in the ./build/3rdparty/onednn/src/Release/
libmkldnn.dll
, libmklml*.dll
, libiomp5.dll
, libopenblas*.dll
, etc) are added to the system PATH. For convinence, you can put all of them to \windows\system32
. Or you will come across Not Found Dependencies
when loading MXNet.Visual Studio 2017
User can follow the same steps of Visual Studio 2015 to build MXNET with oneDNN, but change the version related command, for example,C:\opencv\build\x64\vc15\bin
and build command is as below:
>cmake -G "Visual Studio 15 Win64" .. -DUSE_CUDA=0 -DUSE_CUDNN=0 -DUSE_NVRTC=0 -DUSE_OPENCV=1 -DUSE_OPENMP=1 -DUSE_PROFILER=1 -DUSE_BLAS=mkl -DUSE_LAPACK=1 -DUSE_DIST_KVSTORE=0 -DCUDA_ARCH_NAME=All -DUSE_ONEDNN=1 -DCMAKE_BUILD_TYPE=Release
Preinstall python and some dependent modules:
pip install numpy graphviz set PYTHONPATH=[workdir]\mxnet\python
or install mxnet
cd python sudo python setup.py install python -c "import mxnet as mx;print((mx.nd.ones((2, 3))*2).asnumpy());"
Expected Output:
[[ 2. 2. 2.] [ 2. 2. 2.]]
After MXNet is installed, you can verify if oneDNN backend works well with a single Convolution layer.
from mxnet import np from mxnet.gluon import nn num_filter = 32 kernel = (3, 3) pad = (1, 1) shape = (32, 32, 256, 256) conv_layer = nn.Conv2D(channels=num_filter, kernel_size=kernel, padding=pad) conv_layer.initialize() data = np.random.normal(size=shape) o = conv_layer(data) print(o)
More detailed debugging and profiling information can be logged by setting the environment variable ‘DNNL_VERBOSE’:
export DNNL_VERBOSE=1
For example, by running above code snippet, the following debugging logs providing more insights on oneDNN primitives convolution
and reorder
. That includes: Memory layout, infer shape and the time cost of primitive execution.
dnnl_verbose,info,oneDNN v2.3.2 (commit e2d45252ae9c3e91671339579e3c0f0061f81d49) dnnl_verbose,info,cpu,runtime:OpenMP dnnl_verbose,info,cpu,isa:Intel AVX-512 with Intel DL Boost dnnl_verbose,info,gpu,runtime:none dnnl_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x32x256x256,8.34912 dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb32a:f0,,,32x32x3x3,0.0229492 dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb32a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb32_ic32oc32_ih256oh256kh3sh1dh0ph1_iw256ow256kw3sw1dw0pw1,10.5898
You can find step-by-step guidance to do profiling for oneDNN primitives in Profiling oneDNN Operators.
With MKL BLAS, the performace is expected to furtherly improved with variable range depending on the computation load of the models. You can redistribute not only dynamic libraries but also headers, examples and static libraries on accepting the license Intel Simplified license. Installing the full MKL installation enables MKL support for all operators under the linalg namespace.
Download and install the latest full MKL version following instructions on the intel website. You can also install MKL through YUM or APT Repository.
Create and navigate to build directory mkdir build && cd build
Run cmake -DUSE_CUDA=OFF -DUSE_BLAS=mkl ..
Run make -j
Navigate into the python directory
Run sudo python setup.py install
After MXNet is installed, you can verify if MKL BLAS works well with a linear matrix solver.
from mxnet import np coeff = np.array([[7, 0], [5, 2]]) y = np.array([14, 18]) x = np.linalg.solve(coeff, y) print(x)
You can get the verbose log output from mkl library by setting environment variable:
export MKL_VERBOSE=1
Then by running above code snippet, you should get the similar output to message below (SGESV
primitive from MKL was executed). Layout information and primitive execution performance are also demonstrated in the log message.
mkl-service + Intel(R) MKL: THREADING LAYER: (null) mkl-service + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime mkl-service + Intel(R) MKL: preloading libiomp5.so runtime Intel(R) MKL 2020.0 Update 1 Product build 20200208 for Intel(R) 64 architecture Intel(R) Advanced Vector Extensions 512 (Intel(R) AVX-512) with support of Vector Neural Network Instructions enabled processors, Lnx 2.70GHz lp64 intel_thread MKL_VERBOSE SGESV(2,1,0x7f74d4002780,2,0x7f74d4002798,0x7f74d4002790,2,0) 77.58us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56
To better utilise oneDNN potential, using graph optimizations is recommended. There are few limitations of this feature:
If your use case met above conditions, graph optimizations can be enabled by just simple call optimize_for
API. Example below:
from mxnet import np from mxnet.gluon import nn data = np.random.normal(size=(32,3,224,224)) net = nn.HybridSequential() net.add(nn.Conv2D(channels=64, kernel_size=(3,3))) net.add(nn.Activation('relu')) net.initialize() print("=" * 5, " Not optimized ", "=" * 5) o = net(data) print(o) net.optimize_for(data, backend='ONEDNN') print("=" * 5, " Optimized ", "=" * 5) o = net(data) print(o)
Above code snippet should produce similar output to the following one (printed tensors are omitted) :
===== Not optimized ===== [15:05:43] ../src/storage/storage.cc:202: Using Pooled (Naive) StorageManager for CPU dnnl_verbose,info,oneDNN v2.3.2 (commit e2d45252ae9c3e91671339579e3c0f0061f81d49) dnnl_verbose,info,cpu,runtime:OpenMP dnnl_verbose,info,cpu,isa:Intel AVX-512 with AVX512BW, AVX512VL, and AVX512DQ extensions dnnl_verbose,info,gpu,runtime:none dnnl_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x3x224x224,8.87793 dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00708008 dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,,alg:convolution_direct,mb32_ic3oc64_ih224oh222kh3sh1dh0ph0_iw224ow222kw3sw1dw0pw0,91.511 dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00610352 dnnl_verbose,exec,cpu,eltwise,jit:avx512_common,forward_inference,data_f32::blocked:acdb:f0 diff_undef::undef::f0,,alg:eltwise_relu alpha:0 beta:0,32x64x222x222,85.4392 ===== Optimized ===== dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:Acdb64a:f0 dst_f32::blocked:abcd:f0,,,64x3x3x3,0.00610352 dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:Acdb64a:f0,,,64x3x3x3,0.00585938 dnnl_verbose,exec,cpu,reorder,jit:uni,undef,src_f32::blocked:abcd:f0 dst_f32::blocked:acdb:f0,,,32x3x224x224,3.98999 dnnl_verbose,exec,cpu,convolution,brgconv:avx512_core,forward_inference,src_f32::blocked:acdb:f0 wei_f32::blocked:Acdb64a:f0 bia_f32::blocked:a:f0 dst_f32::blocked:acdb:f0,attr-post-ops:eltwise_relu:0:1 ,alg:convolution_direct,mb32_ic3oc64_ih224oh222kh3sh1dh0ph0_iw224ow222kw3sw1dw0pw0,20.46
After optimization of Convolution + ReLU oneDNN executes both operations within single convolution primitive.
MXNet built with oneDNN brings outstanding performance improvement on quantization and inference with INT8 Intel CPU Platform on Intel Xeon Scalable Platform.
For questions or support specific to MKL, visit the Intel MKL website.
For questions or support specific to oneDNN, visit the oneDNN website.
If you find bugs, please open an issue on GitHub for MXNet with MKL or MXNet with oneDNN.