| - Feature Name: DietCode: An Auto-Scheduler for Dynamic Tensor Programs |
| - Start Date: (2022-05-10) |
| - RFC PR: [apache/tvm-rfcs#xx](https://github.com/apache/tvm-rfcs/pull/xx) |
| - GitHub Issue: [apache/tvm#yy](https://github.com/apache/tvm/pull/yy) |
| |
| # Summary |
| [summary]: #summary |
| |
| We propose to integrate DietCode, an auto-scheduler for dynamic tensor programs, |
| to AutoTIR. DietCode offers the following features: |
| - A shape-generic search space to cover possible shapes in dynamic shape |
| workloads. |
| - A dynamic-shape aware cost model to judge the quality of schedule candidates. |
| - Enhancement to the TVM CUDA codegen for imperfect tiling. |
| |
| DietCode has been published by MLSys 2022 so please see [the |
| paper](https://proceedings.mlsys.org/paper/2022/hash/fa7cdfad1a5aaf8370ebeda47a1ff1c3-Abstract.html) |
| for more details and evaluations. Meanwhile, the latest DietCode codebase is also publicly |
| available [here](https://github.com/UofT-EcoSystem/DietCode). |
| |
| # Motivation |
| [motivation]: #motivation |
| |
| Achieving high performance for compute-intensive operators in machine learning |
| workloads is a crucial but challenging task. Many machine learning and system |
| practitioners rely on vendor libraries or auto-schedulers to do the job. While |
| the former requires significant engineering efforts, the latter in TVM only supports |
| static-shape workloads in existing works. It is difficult, if not impractical, |
| to apply the existing auto-scheduler directly to **dynamic-shape workloads**, as |
| this leads to extremely long tuning time. |
| |
| We observe that the key challenge faced by existing auto-schedulers when |
| handling a dynamic-shape workload is that they cannot construct a conclusive search |
| space for all the possible shapes of the workload, because their search space is |
| shape-dependent. To address this, this RFC aims to add dynamic-shape supports to |
| AutoTIR by integrating DietCode framework, which constructs **a shape-generic |
| search space and cost model** to auto-schedule dynamic-shape workloads |
| efficiently. |
| |
| Our evaluation shows that DietCode has the following key strengths when |
| auto-scheduling an entire model end-to-end: |
| |
| 1. reduces the auto-scheduling time by up to 5.88x less than the current |
| auto-scheduler on 8 uniformly sampled dynamic shapes, and |
| 1. improves performance by up to 69.5% better than the auto-scheduler and 18.6% |
| better than the vendor library. All these advantages make DietCode an |
| efficient and practical solution for dynamic-shape workloads. |
| |
| |
| # Guide-Level Explanation |
| [guide-level-explanation]: #guide-level-explanation |
| |
| The existing experiments are largely conducted with auto-scheduler. However, |
| having been syncing with the AutoTIR team for quarters, we plan to integrate |
| this RFC to MetaSchedule (AutoTIR), because it provides more systematic |
| interface and cleaner integration path with less hacks. |
| |
| To provide an example of additional information users are required to feed the |
| system (see https://github.com/UofT-EcoSystem/DietCode/tree/MLSys2022_AE for a |
| PoC design): |
| |
| ```python |
| # A symbolic shape constraint |
| T = tir.ShapeVar('T’) |
| I = tir.ShapeVar('I') |
| H = tir.ShapeVar('H') |
| # The candidate values of `T` |
| T_vals = range(1, 128) |
| wkl_insts = [] |
| for t in T_vals: |
| wkl_insts.append((t, 768, 768)) |
| wkl_insts.append((t, 768, 3072)) |
| wkl_insts.append((t, 3072, 768)) |
| |
| |
| task = Task(func=Dense, |
| args=(16*T, I, H), |
| shape_vars=(T, I, H), |
| wkl_insts=wkl_insts |
| wkl_inst_weights=([1. for _ in T_vals],)) |
| ``` |
| |
| To enable auto-scheduling for dynamic shape workloads, users only need to: |
| 1. Have `ShapeVar` in the TE/TensorIR compututation. |
| 2. Specify the weight/distribution of each shape value. |
| |
| Notes: |
| 1. Symbolic constraint is required additional in Relay, but could be inferred |
| automatically after Relax is introduced; |
| 2. The proposed interface does not change any existing functionality. |
| |
| # Reference-Level Explanation |
| [reference-level-explanation]: #reference-level-explanation |
| |
| Here is an overview of the DietCode framework design. |
| |
| <img src="https://raw.githubusercontent.com/UofT-EcoSystem/DietCode/main/docs/figures/DietCode.jpg" width="61.8%" /> |
| |
| - We accept the shape variables and the workload instances from the programmer. |
| In the case when they are not detected, the auto-scheduler treats the workload |
| as static and applies and current workflow on it. |
| - We construct **a shape-generic search space that consists of micro-kernels**, |
| an incomplete program that carries out a tile of the complete computation, to |
| efficiently support dynamic-shape workloads. |
| |
| We use the hardware constraints (e.g., the maximum number of threads, the |
| amount of shared and local memory) rather than the shape information to |
| determine the micro-kernel candidates. Those candidates serve as the building |
| blocks and are executed repeatedly to carry out a workload instance (defined |
| as an static-shape instance of the dynamic-shape workload). |
| - We build a **micro-kernel-based cost model**. The key insight is that the cost |
| of a complete program *P* that is made up of a micro-kernel *M* can be |
| decomposed into two parts: |
| |
| 1. A shape-generic cost function *f*<sub>MK</sub> that predicts the cost of |
| *M*, and |
| 1. A shape-dependent adaption cost function *f*<sub>adapt</sub> that defines |
| the penalty of porting *M* to *P*. |
| |
| While *f*<sub>MK</sub> is a function that has to be learned and updated by |
| real hardware measurements during the auto-scheduling process, |
| *f*<sub>adapt</sub> is a simple term that can be evaluated using the core |
| occupancy and the padding ratio (in other words, it does not require feature |
| extraction from the schedules). |
| - We generate one kernel per workload instance and use the scikit-learn |
| framework to train a decision tree dispatcher to map the workload instance to |
| its corresponding kernel. The decision tree will be output in predicate-only |
| format for efficient runtime dispatching and embedded as part of the host |
| code. As an example, one possible auto-scheduling outcome can look like the |
| following: |
| ```C++ |
| __global__ void default_function0(float* X, float* W, float* Y) {...} |
| __global__ void default_function1(float* X, float* W, float* Y) {...} |
| __global__ void default_function2(float* X, float* W, float* Y) {...} |
| |
| // host code |
| if (T < 16) |
| call(default_function0) |
| else if (T < 64) |
| call(default_function1) |
| else |
| call(default_function2) |
| ``` |
| Because everything can be included in a single `PackedFunc` object, the |
| workflow is fully compatible with the Relay workflow. |
| |
| # Drawbacks |
| [drawbacks]: #drawbacks |
| |
| - The current compilation workflow generates one program per input shape. |
| Although we can merge those static-shape programs into a single dynamic-shape |
| program like the following code snippet: |
| ```CUDA |
| __global__ void default_function(float* X, float* W, float* Y, |
| const int T) |
| // Note the `T` here. |
| ``` |
| Our evaluations indicate that this program has at least 5% worse performance |
| compared with the static-shape alternatives. Hence, we decide to sacrifice the |
| binary size for the runtime performance, which can potentially be problematic |
| when the hardware resources are limited. |
| |
| # Rationale and Alternatives |
| [rationale-and-alternatives]: #rationale-and-alternatives |
| |
| There is an approach proposed by [Nimble](https://arxiv.org/pdf/2006.03031.pdf), |
| which partitions a range of dynamic shape to buckets and tunes one kernel for |
| each bucket. We could, of course, implement this approach to the current |
| auto-scheduler and AutoTIR. However, as evaluated in the DietCode paper, this |
| approach is not guaranteed to achieve better performance as static shapes. |
| |
| # Prior State-of-the-Arts |
| [prior-sotas]: #prior-sotas |
| |
| - **Reuse-based Tuner** |
| |
| Selective Tuning ([Cody Yu. |
| 2019](https://github.com/apache/incubator-tvm/issues/4188)) and ETO ([Jingzhi |
| Fang et al. VLDB 2021](http://www.vldb.org/pvldb/vol15/p183-chen.pdf)) group |
| workloads into clusters based on a set of pre-defined rules (e.g., similarity |
| ratio in Selective Tuning) and reuse the same schedule in a single cluster. |
| |
| - **Dynamic Neural Networks** |
| |
| Dynamic batching is a common graph-level optimization adopted by frameworks |
| such as DyNet ([Graham Neubig et al. 2017](http://arxiv.org/abs/1701.03980)), |
| Cavs ([Shizhen Xu et al. USENIX ATC |
| 2018](https://www.usenix.org/conference/atc18/presentation/xu-shizen)), |
| BatchMaker ([Pin Gao et al. EuroSys |
| 2018](https://doi.org/10.1145/3190508.3190541)), and TensorFlow Fold ([Moshe |
| Looks et al. ICLR 2017](https://openreview.net/forum?id=ryrGawqex)) for cases |
| when the batch size is dynamic. |
| |
| Nimble ([Haichen Shen et al. MLSys |
| 2021](https://proceedings.mlsys.org/paper/2021/hash/4e732ced3463d06de0ca9a15b6153677-Abstract.html)) |
| and DISC ([Kai Zhu et al. EuroMLSys |
| 2021](https://dl.acm.org/doi/10.1145/3437984.3458838)) both design a compiler |
| to represent and execute dynamic neural networks. |
| |
| Cortex ([Pratik Fegade et al. MLSys |
| 2021](https://proceedings.mlsys.org/paper/2021/hash/182be0c5cdcd5072bb1864cdee4d3d6e-Abstract.html)) |
| is a compiler-based framework on recursive neural networks. |
| |
| Those works focus on the graph-level optimizations and therefore are |
| orthogonal to DietCode, which operates on each individual layer. In fact, |
| those graph-level solutions can also leverage DietCode for efficient operator |
| code generation. |
| |
| # Unresolved Questions |
| [unresolved-questions]: #unresolved-questions |
| |
| - The current design does not support arbitrary shape dimensions. For better |
| auto-scheduling outcomes, we expect that shape dimensions have to be specified |
| beforehand. |
| - The proposed approach mostly works on NVIDIA GPUs and has not been tested on |
| other hardware platforms. |
| |
| # Future Possibilities |
| [future-possibilities]: #future-possibilities |
| |
| - Evaluate more operator use cases. |
| - CPU Support |
| |
| # Upstream Milestones |
| [upstream-milestones]: #upstream-milestones |
| |
| We propose the following milestones for upstreaming, where each bullet point |
| corresponds to a PR with unit tests of roughly several hundred lines. |
| |
| - [ ] Code Generation Support |
| - Local Padding |
| - Loop Partitioning |
| - [ ] Meta-Scheduler |
| - Frontend Interface |
| - Sketch Generation |
| - Random Annotations |
| - Program Measurer |
| - Micro-Kernel Cost Model |
| - Evolutionary Search |
| - [ ] Decision-Tree Dispatching |
| |
| When testing, we will be following the same testing procedure with the |
| meta-scheduler. We do not require any extra hardware platforms. Our plan is to |
| use a dynamic-shape workload (i.e., dense from BERT and conv2d from ResNet-50) |
| and compare its performance numbers with those delivered by the meta-scheduler |
| on static-shape workloads. The performance difference is expected to be smaller |
| than 5%. |