This proposal introduces Meta Schedule: a scheduling DSL on TIR that unifies the approaches of AutoTVM [1] and AutoScheduler [2]. Meta schedule provides a pragmatic way to define the space of automatic tuning, extensibility in terms of all possible TIR schedule primitives like tensorization and loop partitioning, and customizability on every layer of the automation system.
Meta Schedule is the 3rd generation automatic scheduling system.
Scheduling. In TVM, computation of each operator is described by either Tensor Expression (TE, the prior standard) or TensorIR (TIR). Through a process known as scheduling, TVM allows a pre-defined set of transformations over these IRs, either lazily (transforms by manipulating the schedule tree) or eagerly (transforms by manipulating TIR), so that the end IRs could be lowered to code with potentially better performance. Such transformations are guided by a developer-provided program, which consist of this pre-defined set of transformations, called schedule primitives.
Design space. The set of all possible schedulings of a TE/TensorIR is called its design space. Optimization in TVM is essentially exploring such space to find out a scheduling that transforms the IR to generate the kernel with optimal performance.
Currently there are 3 sets of scheduling APIs in TVM:
The existing three scheduling systems are mutually incompatible with each other in terms of API design and divergence: besides manual TE scheduling, AutoTVM requires users to learn a new set of APIs, and AutoScheduler brings in another set of C++-based search rules. It adds the users' mental overhead to understand and extend the existing systems. Further, the inability to switch between template-based and template-free auto-tuning could lead to inferior customizability and hence make it needlessly difficult to achieve optimal performance.
Meta schedule provides:
Meta Schedule DSL is a language that provides TVM backend developers a flexible way to define or auto-generate the operator design space.
This section introduces the syntax of Meta Schedule DSL by describing the 5 common usage patterns envisioned by this RFC. These patterns are:
Meta schedule APIs are almost the same as TE or TensorIR scheduling. Here is an example of a manual schedule for matrix multiplication:
# Designate a set of tile sizes i_tiles = [16, 8, 8, 8] j_tiles = [16, 8, 8, 8] k_tiles = [256, 8] # Tile the loops according to the tile sizes i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles) j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles) k_0, k_1 = sch.split(loop=k, factors=k_tiles) # Organize the loops into "SSRSRS" 6-level tiles sch.reorder( i_0, j_0, # S: the 1st spatial tile i_1, j_1, # S: the 2nd spatial tile k_0, # R: the 1st reduction tile i_2, j_2, # S: the 3rd spatial tile k_1, # R: the 2nd reduction tile i_3, j_3, # S: the 4th spatial tile )
In this manual scheduling example, the developers tweak the tile sizes and measure the performance of the generated kernels to explore the opportunities of potential optimization.
Generally speaking, while writing a schedule, there are often some parameters that are hard to determine ahead of time, for example, tile sizes, unroll steps, or which tensor intrinsics to use. Developers may manually enumerate possible combinations of these unknown factors, and then pick the best schedule according to measurement results on their device.
As introduced in the previous section, in TensorIR, each schedule primitive handles only a very basic transformation of the IR. For example, split
only splits a loop into two new loops. In the real world, the over-fine granularity of those primitives usually leads to repetitive and verbose scheduling code. Take the code snippet in the previous section as an example: a sequence of split
s are invoked, followed by a reorder
. Taken together these 4 primitives are colloquially known as “SSRSRS” tiling.
To make it more convenient and modular, users are allowed to register composite schedules that apply a sequence of schedule primitives according to certain analysis of the IR. The word composite here means the schedule transformation is composed of those primitives.
For example, suppose there is a composite schedule called Inline-Elementwise-Operation
, which inlines elementwise computation into their consumers if possible. Applying it to the following TensorIR:
@tvm.script.tir def example_func(...): for i, j in ...: with tir.Block("B") ...: B[i, j] = A[i, j] + 1 for i, j in ...: with tir.Block("C") ...: C[i, j] = B[i, j] + 1 for i, j in ...: with tir.Block("D") ...: D[i, j] = C[i, j] + 1 sch = tir.Schedule(example_func) # `InlineElementwiseOperation` is a composite schedule rule that analyzes a given block. # If the block contains only elementwise computation, and can be inlined into its consumer, # then `sch.compute_inline` is called on that block. inliner = InlineElementwiseOperation() inliner.apply(schedule=sch, block=sch.get_block("B")) inliner.apply(schedule=sch, block=sch.get_block("C")) inliner.apply(schedule=sch, block=sch.get_block("D"))
Below is the result after applying this composite schedule, and its corresponding trace:
>>> print(tvm.script.asscript(sch.mod)) @tvm.script.tir def example_func(...): for i, j in ...: with tir.Block("D") ...: D[i, j] = A[i, j] + 1 + 1 + 1 >>> print(sch.trace) # Block "B" is elementwise and inlinable, then `sch.compute_inline(B)` is called B = sch.get_block("B") sch.compute_inline(B) # Block "C" is elementwise and inlinable, then `sch.compute_inline(C)` is called C = sch.get_block("C") sch.compute_inline(C) # Block "D" is elementwise but does not have a consumer, # so the rule does not call `compute_inline` because it is not inlinable D = sch.get_block("D")
Meta schedule extends the schedule DSL with a set of new schedule primitives called sampling instructions. These primitives do not transform the TensorIR, but instead introduce random statistical variables which can be referenced later in scheduling to parameterize the schedule. Incorporating sampling instructions into a operator's schedule allows the backend developers to succinctly describe a design space in terms of tiling strategies, fusion levels, unroll lengths, etc.
The matmul example above is extended to cover all possible tilings using these sampling instructions:
# Sample tile sizes i_tiles = sch.sample_perfect_tile(i, n=4) # was: i_tiles = [16, 8, 8, 8] j_tiles = sch.sample_perfect_tile(j, n=4) # was: j_tiles = [16, 8, 8, 8] k_tiles = sch.sample_perfect_tile(k, n=2) # was: k_tiles = [256, 8] # Tile the loops according to the random variables i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles) j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles) k_0, k_1 = sch.split(loop=k, factors=k_tiles) # Organize the loops into "SSRSRS" 6-level tiles sch.reorder( i_0, j_0, # S: the 1st spatial tile i_1, j_1, # S: the 2nd spatial tile k_0, # R: the 1st reduction tile i_2, j_2, # S: the 3rd spatial tile k_1, # R: the 2nd reduction tile i_3, j_3, # S: the 4th spatial tile )
To generate design space, AutoScheduler applies a set of rules to each TE stage that corresponds to a TE operation, defined by te.compute(...)
. The rules analyze the TE operations and apply an internal DSL to manipulating its internal IR, which is in the end mapped to TE schedule primitives. This process is called sketch generation.
Composite schedule rules work in a similar way scheduling TensorIR, as introduced in Section 3.2. It analyzes the TensorIR and apply schedule primitives directly to TensorIR accordingly. When applying such rules to each TensorIR block in certain order (Post-DFS is provided as the builtin order, but customization is allowed), it generates a sequence of schedule primitives. This process corresponds to the sketch generation phase in AutoScheduler. If sampling instructions are present in this sequence, then a design space is defined by those instructions for the meta schedule to explore. This process is similar to the random annotation phase in AutoScheduler.
Several built-in composite schedule rules are shipped with meta schedule to align with the design space generated by AutoScheduler:
Developers may implement their own rules in either Python or C++. They may specify which rules to use with the following syntax:
from tvm import meta_schedule as ms design_space_generator = ms.PostOrderApply(rules=[ ms.MultiLevelTiling(...), CustomRule(...), ms.OtherBuiltinRules(...), ])
This subsection shows that the design space induced by TE manual schedule, AutoTVM and AutoScheduler are all subsets of meta schedule, and meta schedule further allows mixing those three styles to search jointly.
This section introduces the underlying techniques for the automation system to extract and explore the design space. The figure below briefly illustrates the workflow of the system:
Trace. To represent the design space defined by the meta schedule DSL, the underlying system records all the instructions users applied to the schedule class, including sampling and schedule primitives. This list of scheduling instructions being invoked, along with the random decisions made on sampling instructions, is called a trace.
For instance, executing the example above results in the following trace:
Instruction 0. Sample-Perfect-Tile Instruction 1. Sample-Perfect-Tile Instruction 2. Sample-Perfect-Tile Instruction 3. Split Instruction 4. Split Instruction 5. Split Instruction 6. Reorder
Trace forms design space. A trace may contain zero or more sampling instructions, which introduces the uncertainty in scheduling - one instance of sampling results in one point in the design space. Therefore, the trace itself forms a design space to explore, e.g. which set of tile sizes works best on a specific hardware.
Union of design space. Meta schedule works on a set of traces, representing the union of the design spaces represented by every single trace.
Fork a trace. When two different decisions in the scheduling process are equally important to generate high-performance schedules, it is allowed to fork the trace into two, and the design space is the union of the forked traces.
The trace is not strictly user-facing, but can be accessed and printed with the following syntax:
# requires to trace the execution sch = tir.Schedule(..., traced=True) # do a lot of scheduling ... # print the trace print(sch.trace)
And below is an example of the printed trace, which honestly reflects the schedule as a snippet of python scheduling function:
b0 = sch.get_block(name="matmul", func_name="main") l1, l2, l3 = sch.get_loops(block=b0) v4, v5, v6, v7 = sch.sample_perfect_tile(loop=l1, n=4, max_innermost_factor=16, decision=[32, 1, 16, 2]) v8, v9, v10, v11 = sch.sample_perfect_tile(loop=l2, n=4, max_innermost_factor=16, decision=[64, 4, 2, 2]) v12, v13 = sch.sample_perfect_tile(loop=l3, n=2, max_innermost_factor=16, decision=[64, 16]) l14, l15, l16, l17 = sch.split(loop=l1, factors=[v4, v5, v6, v7]) l18, l19, l20, l21 = sch.split(loop=l2, factors=[v8, v9, v10, v11]) l22, l23 = sch.split(loop=l3, factors=[v12, v13]) sch.reorder(l14, l18, l15, l19, l22, l16, l20, l23, l17, l21)
Implementation. A trace is defined as:
class Trace: instructions: List[Instruction] decisions: Dict[Instruction, Any]
For each sampling instruction in the trace, if it has a corresponding entry in the decisions dict, then the output is uniquely determined by the decision, hence reproducibility is guaranteed (Example 1); If a corresponding entry is not presented, then randomness will be introduced by interpreting the trace (Example 2).
# Example 1. Trace with deterministic result l1, l2 = sch.sample_perfect_tile(loop, n=2, decisions=[4, 32]) # Deterministic l1 = 4, l2 = 32 # Example 2. Trace with randomized result l1, l2 = sch.sample_perfect_tile(loop, n=2) # l1 and l2 are random
Meta Schedule provides several built-in exploration strategies to exhaustively or efficiently search for efficient schedules. Those strategies are mostly supported by re-execute either a function or a trace with a builtin interpreter in meta schedule, and this process is called replay.
With a user-provided schedule function as a black-box design space generator, meta schedule repetitively invokes such an opaque TVM packed function without doing any extra analysis. If sampling instructions are present in the trace, then scheduling is non-deterministic (random decisions may not be repeated across runs) Effectively, it is equivalent to random exploration without trace, allowing the flexibility for users to define arbitrary functions that trace may not well support (e.g. control flow divergence based on the value of intermediate random variables), but it forbids future opportunity of any trace-based analysis.
A builtin interpreter directly replays the traces obtained from manual schedule, template-based or template-free design space generation. If sampling instructions are present in the traces, then their random decisions are mutated during each replay, i.e. jumps to a new point in the design space. Therefore, repetitive replay of those traces are equivalent to exploration of the design space.
The search speed of meta schedule could be improved by allowing traces to be analyzed before they are run. For example, trace analysis could reject obviously-invalid schedules (e.g. using too many CUDA resource), remove dead-code before they are run. The cost model could benefit from the traces as well by extracting trace-level features.
A more efficient exploration strategy introduced by AutoScheduler. For more details, please refer to Section 5.1 of its paper [2].
The evolutionary search strategy requires two sets of rule-specific logic to execute to either validate or tweak the produced schedule:
The evolutionary search algorithm uses mutators to find possible schedules in the design space, then applies postprocessors and asks the cost model to predict its performance. After a few iterations, the new schedules with the highest scores are finally compiled and measured on device. Epsilon-greedy is used in this process to balance exploitation and exploration.
All the measure records are serialized and stored in a database. The schema of the database has the following information:
The system is implemented in a way that all levels are decoupled and open to customization, aiming at providing a playground for developers to try out new ideas and potentially deliver performance quickly.
While all the important APIs are implemented in C++ for efficiency, every part of the system can be easily switched to customized python implementation. For example,
The design space can be defined as a python function
def schedule_matmul(sch) -> sch: i, j, k = sch.get_loops(sch.get_block("matmul")) i_tiles = sch.sample_perfect_tile(i, n=4) j_tiles = sch.sample_perfect_tile(j, n=4) k_tiles = sch.sample_perfect_tile(k, n=2) # Tile the loops according to the random variables i_0, i_1, i_2, i_3 = sch.split(loop=i, factors=i_tiles) j_0, j_1, j_2, j_3 = sch.split(loop=j, factors=j_tiles) k_0, k_1 = sch.split(loop=k, factors=k_tiles) # Organize the loops into "SSRSRS" 6-level tiles sch.reorder( i_0, j_0, # S i_1, j_1, # S k_0, # R i_2, j_2, # S k_1, # R i_3, j_3, # S ) return sch
The system provides two interfaces to define a composite schedule in python, one is more succinct, and the other is more comprehensive:
Method 1. Derive from PyCompositeSchedule
, and implement two methods initialize
and apply
:
class MultiLevelTiling(PyCompositeSchedule): def initialize(...): # initialize the class, usually this method is empty ... def apply(sch: Schedule, block: BlockRV) -> Union[Schedule, List[Schedule]]: # write any python code, including: # - analyze `block` # - invoke schedule primitives in `sch` # - do debug printing ...
Method 2. A decorator as the syntactic sugar if the initialize
method is empty, which converts the function to the apply
method.
@tir.as_composite_schedule(name="multi-level-tiling") def multi_level_tiling(sch: Schedule, block: BlockRV) -> Union[Schedule, List[Schedule]]: # write any python code, including: # - analyze `block` # - invoke schedule primitives in `sch` # - do debug printing ...
Developers can implement any search algorithm in python as well by deriving from PySearchPolicy
, and the syntax is identical to customizing with PyCompositeSchedule
.
This list includes:
In a short summary, almost every component of the system is decoupled with each other and extensions could be easily plugged in.
Meta schedule requires integration with Relay operator strategy and compile engine (TECompiler) to properly lower a Relay subgraph for task extraction. Further, TE schedules in TOPI will need to be migrated to TensorIR schedules.
The system is designed with the principle of minimalism: Assuming users already know TensorIR scheduling APIs, there is no more extra API set to learn; The previous programs that schedules TensorIR still work out of box with the meta schedule DSL. It could potentially lower the bar of adopting this system.
Unifying manual scheduling, AutoTVM‘s semi automatic templates and AutoScheduler’s fully automatic sketch generation provides flexible way to balance injection new domain knowledge and automation.
Flexibility in customization allows quick try-out on new tasks, new strategies and new hardware targets without deep knowledge of the system.
Tensor Expression (TE) in TVM is a DSL that decouples compute and schedule, which provides convenient ways to handcraft optimized kernels for different hardware targets.
TensorIR is the low-level IR in TVM. Its capability of eagerly applying schedule primitives opens the door for meta schedule, the proposed next-generation auto scheduling system.
AutoTVM is the 1st generation automation framework in TVM, which requires developers to implement per-operator scheduling templates, and the system could handle the tuning process.
AutoScheduler is the 2nd generation automation framework in TVM, whose built-in rules could automatically generate schedule templates for almost all the operators on CPU, GPU, etc.
Control Flow
The meta schedule DSL does not support control flow yet. Although there is no report of real-world use case at the time of writing, it is possible that it could appear in some future workloads. The best syntax of the control flow is not determined yet, but a working example could be TensorFlow's tf.cond
.
Assertion
Sampling instructions may lead to wrong schedules on CUDA, e.g. the resulting program uses too much shared memory, too many threads, etc. It is detected by a postprocessor. To accelerate the process, it is possible to introduce an assertion statement that exits early if the GPU code is not valid, and its syntax can be, for example:
sch.assert(j_2 * j_2 <= 1024)
Unifying Manual Scheduling, AutoTVM and AutoScheduler in TOPI
As described in Section 3.5, meta schedule provides an idiomatic approach to unify the three existing scheduling APIs in TVM:
At the time of writing, TOPI contains a number of schedule functions implemented either in manual TE or AutoTVM-style. It is our future work to unify these existing scheduling APIs on TOPI operators, and enable different styles to be auto-tuned jointly.
[1] Zheng, Lianmin, et al. “Ansor: Generating high-performance tensor programs for deep learning.” 14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20). 2020.
[2] Chen, Tianqi, et al. “Learning to Optimize Tensor Programs.” Advances in Neural Information Processing Systems 31 (2018): 3389-3400.