|author||Przemyslaw Tredak <email@example.com>||Thu Aug 20 14:14:03 2020 -0700|
|committer||GitHub <firstname.lastname@example.org>||Thu Aug 20 14:14:03 2020 -0700|
Use RTC for elementwise and broadcast ops (#18622) * Reapplying PR #17767 * Making RTC required * Move cuda utils to src/common/cuda and refactor RTC part * Unary ops via RTC * Support binary_scalar forward Remove elemwise_scatter_op.* Fix BinaryScalar usage in NumPy * Backward of binary scalar * Binary forward * Fix for binary_scalar * Moving all binary forward to RTC Reorganization * Backward of binary ops * Suuport broadcast Add RTC to NumPy ops * RTC for elementwise sum Fixes * RTC for backward usenone of broadcast * RTC for broadcast bwd usein * Remove non-RTC vectorization support * Remove template from ReduceWorkspaceSize * Fixes from rebase * Guarding RTC usage behing MXNET_USE_CUDA * More guards * C++17 for CUDA code * MixedUnaryBackwardInOut as RTC * Removing unused variable * Revert "C++17 for CUDA code" This reverts commit b09090ca4564a3e76367ffeb8ade45f521d24482. * Get rid of CI tests without RTC Get rid of if constexpr as CUDA 10 does not support it * Fix lint * Change a few more elemwise functions Fix for too long value * Fix large tensor build * Another try with DBL_MAX * Fix Windows compilation * Fix the large int test * Add the printing of error code value to CUDA_DRIVER_CALL * Fix * Fix binary scalar * Get more information when cuLaunchKernel fails * Going easy on Windows compiler * Fix lint * Reorganization to split strings due to Windows compilation problems * Fix error with uninitialized value * Fix handling of different types for backward of binary scalar * Decreasing RTC overhead * Fix lint and remove rest of mentions of ENABLE_RTC * Jetson with RTC * Fix the aws s3 command * Debugging Windows failure * More debugging of Windows failure * Debug * Fix the issue on Windows (long -> long long for 8B) * libcuda.so for Jetson * Enable debug information for RTC kernels and cleaning debug ptx dump * Fix lint * Try without linking the stub of libcuda.so to different place in Jetson * Add docstring * Answering review comments * Unifying vectorization * Fix * Fixes for reduce ops * Fix M=1 case * Fixes from rebase Fixes for mixed type gradient functions Set the launch bounds on RTC kernels * Fix * Fix tests * Adding tutorial for RTC * Fixes after merge * Fixes from review * Change env var doc and undo the change to toctree
Apache MXNet (incubating) is a deep learning framework designed for both efficiency and flexibility. It allows you to mix symbolic and imperative programming to maximize efficiency and productivity. At its core, MXNet contains a dynamic dependency scheduler that automatically parallelizes both symbolic and imperative operations on the fly. A graph optimization layer on top of that makes symbolic execution fast and memory efficient. MXNet is portable and lightweight, scaling effectively to multiple GPUs and multiple machines.
MXNet is more than a deep learning project. It is a collection of blue prints and guidelines for building deep learning systems, and interesting insights of DL systems for hackers.
Licensed under an Apache-2.0 license.
Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015
MXNet emerged from a collaboration by the authors of cxxnet, minerva, and purine2. The project reflects what we have learned from the past projects. MXNet combines aspects of each of these projects to achieve flexibility, speed, and memory efficiency.