Use RTC for elementwise and broadcast ops (#18622)

* Reapplying PR #17767

* Making RTC required

* Move cuda utils to src/common/cuda and refactor RTC part

* Unary ops via RTC

* Support binary_scalar forward

Remove elemwise_scatter_op.*

Fix BinaryScalar usage in NumPy

* Backward of binary scalar

* Binary forward

* Fix for binary_scalar

* Moving all binary forward to RTC


* Backward of binary ops

* Suuport broadcast

Add RTC to NumPy ops

* RTC for elementwise sum


* RTC for backward usenone of broadcast

* RTC for broadcast bwd usein

* Remove non-RTC vectorization support

* Remove template from ReduceWorkspaceSize

* Fixes from rebase

* Guarding RTC usage behing MXNET_USE_CUDA

* More guards

* C++17 for CUDA code

* MixedUnaryBackwardInOut as RTC

* Removing unused variable

* Revert "C++17 for CUDA code"

This reverts commit b09090ca4564a3e76367ffeb8ade45f521d24482.

* Get rid of CI tests without RTC
Get rid of if constexpr as CUDA 10 does not support it

* Fix lint

* Change a few more elemwise functions
Fix for too long value

* Fix large tensor build

* Another try with DBL_MAX

* Fix Windows compilation

* Fix the large int test

* Add the printing of error code value to CUDA_DRIVER_CALL

* Fix

* Fix binary scalar

* Get more information when cuLaunchKernel fails

* Going easy on Windows compiler

* Fix lint

* Reorganization to split strings due to Windows compilation problems

* Fix error with uninitialized value

* Fix handling of different types for backward of binary scalar

* Decreasing RTC overhead

* Fix lint and remove rest of mentions of ENABLE_RTC

* Jetson with RTC

* Fix the aws s3 command

* Debugging Windows failure

* More debugging of Windows failure

* Debug

* Fix the issue on Windows (long -> long long for 8B)

* for Jetson

* Enable debug information for RTC kernels and cleaning debug ptx dump

* Fix lint

* Try without linking the stub of to different place in Jetson

* Add docstring

* Answering review comments

* Unifying vectorization

* Fix

* Fixes for reduce ops

* Fix M=1 case

* Fixes from rebase
Fixes for mixed type gradient functions
Set the launch bounds on RTC kernels

* Fix

* Fix tests

* Adding tutorial for RTC

* Fixes after merge

* Fixes from review

* Change env var doc and undo the change to toctree
141 files changed
tree: f8243148969f71cfcfe0463ef0c5c33b6fe5ad4e
  1. .clang-tidy
  2. .codecov.yml
  3. .gitattributes
  4. .github/
  5. .gitignore
  6. .gitmodules
  7. .mxnet_root
  8. 3rdparty/
  9. CMakeLists.txt
  14. KEYS
  18. NOTICE
  21. benchmark/
  22. cd/
  23. ci/
  24. cmake/
  25. config/
  27. contrib/
  28. docker/
  29. docs/
  30. example/
  31. include/
  32. plugin/
  33. pytest.ini
  34. python/
  35. readthedocs.yml
  36. snap.python
  37. src/
  38. tests/
  39. tools/

Apache MXNet (incubating) for Deep Learning

CentOS CPU Build Status CentOS GPU Build Status Clang Build Status
Edge Build Status Miscellaneous Build Status Sanity Build Status
Unix CPU Build Status Unix GPU Build Status Website Build Status
Windows CPU Build Status Windows GPU Build Status
Documentation StatusGitHub license


Apache MXNet (incubating) is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scaling effectively to multiple GPUs and multiple machines.

MXNet is more than a deep learning project. It is a collection of blue prints and guidelines for building deep learning systems, and interesting insights of DL systems for hackers.

Ask Questions

How to Contribute

What's New



  • Design notes providing useful insights that can re-used by other DL projects
  • Flexible configuration for arbitrary computation graph
  • Mix and match imperative and symbolic programming to maximize flexibility and efficiency
  • Lightweight, memory efficient and portable to smart devices
  • Scales up to multi GPUs and distributed setting with auto parallelism
  • Support for Python, Scala, C++, Java, Clojure, R, Go, Javascript, Perl, Matlab, and Julia
  • Cloud-friendly and directly compatible with AWS S3, AWS Deep Learning AMI, AWS SageMaker, HDFS, and Azure


Licensed under an Apache-2.0 license.

Reference Paper

Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015


MXNet emerged from a collaboration by the authors of cxxnet, minerva, and purine2. The project reflects what we have learned from the past projects. MXNet combines aspects of each of these projects to achieve flexibility, speed, and memory efficiency.