docs/arch/index.rst - tvm - Git at Google

 ..  Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at

 ..    http://www.apache.org/licenses/LICENSE-2.0

 ..  Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

 Design and Architecture
 =======================

 This document is intended for developers who want to understand the architecture of Apache TVM and/or actively develop on the project.
 This page is organized as follows:

 - The `Overall Flow`_ gives an overview of the steps that TVM takes to turn a high level description of a model into a deployable module.
   To get started, please read this section first.
 - Brief introduction to the key components of the TVM stack. Feel free to also check out the :ref:`TensorIR Deep Dive <tensor-ir-deep-dive>`
   and :ref:`Relax Deep Dive <relax-deep-dive>` for more details about the two major components in the TVM stack.

 This guide provides a few complementary views of the architecture.
 First, we review a single end-to-end compilation flow and discuss the key data structures and the transformations.
 This runtime-based view focuses on the interactions of each components when running the compiler.
 Then we will review the logical modules of the codebase and their relationship. This part provides a static overarching view of the design.

 Overall Flow
 ------------

 In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:

 - **Model Creation**: Create the IRModule to be optimized and compiled, which contains a collection of functions that internally represent the model.
   Users can manually construct IRModule via NNModule, TVMScript, or import a pre-trained model from Relax frontend.
 - **Transformation**: The compiler transforms an IRModule to another functionally equivalent or approximately
   equivalent(e.g. in the case of quantization) IRModule. Many of the transformations are target (backend) independent.
   We also allow target to affect the configuration of the transformation pipeline.
 - **Target Translation**: The compiler translates(codegen) the IRModule to an executable format specified by the target.
   The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
 - **Runtime Execution**: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.


 .. figure:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/design/tvm_overall_flow.svg
    :align: center
    :width: 80%


 Key data structures
 ~~~~~~~~~~~~~~~~~~~

 One of the best ways to design and understand a complex system is to identify the key data structures and APIs that
 manipulate (transform) these data structures. Once we identified the key data structures, we can then breakdown a system into logical
 components that either define a collection of key data structures or transformations among the data structures.

 **IRModule** is the primary data structure used across the entire stack. An IRModule (intermediate representation module)
 contains a collection of functions. Currently, we support two primary variants of functions.

 - **relax::Function** is a high-level functional program representation. A relax.Function represents high-level graph structure,
   usually corresponds to an end-to-end model or a sub-graph of the overall model. You can view a relax.Function as a computational
   graph with additional support for control-flow, and complex data structures.
 - **tir::PrimFunc** is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store,
   threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.

 During the compilation and transformation, all relax operators are lowered to ``tir::PrimFunc`` or ``TVM PackedFunc``, which can be executed directly
 on the target device, while the calls to relax operators are lowered to calls to low-level functions (e.g. ``R.call_tir`` or ``R.call_dps``).

 Transformations
 ~~~~~~~~~~~~~~~

 Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:

 - optimization: transform a program to an equivalent, possibly more optimized version.
 - lowering: transform a program to a lower-level representation that is closer to the target.

 relax transformations
 ^^^^^^^^^^^^^^^^^^^^^
 relax transformations contain a collection of passes that apply to relax functions. The optimizations include common graph-level
 optimizations such as constant folding and dead-code elimination for operators, and backend-specific optimizations such as library dispatch.

 tir transformations
 ^^^^^^^^^^^^^^^^^^^
 tir transformations contain a collection of passes that apply to tir functions. There are two major types of transformations:

 - **TensorIR schedule**: TensorIR schedules are designed to optimize the TensorIR functions for a specific target, with user-guided instructions and control how the target code is generated.
   For CPU targets, TIR PrimFunc can generate valid code and execute on the target device without schedule but with very-low performance. However, for GPU targets, the schedule is essential
   for generating valid code with thread bindings. For more details, please refer to the :ref:`TensorIR Transformation <tir-transform>` section. Additionally, we provides ``MetaSchedule`` to
   automate the search of TensorIR schedule.
 - **Lowering Passes**: These passes usually perform after the schedule is applied, transforming a TIR PrimFunc into another functionally equivalent PrimFunc, but closer to the
   target-specific representation. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones,
   and to decorate the function entry to meet the runtime calling convention.

 Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation
  to the downstream compilers and only focus on optimizations that are not covered by them.

 cross-level transformations
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
 Apache TVM brings a unity strategy to optimize the end-to-end models. As the IRModule includes both relax and tir functions, the cross-level transformations are designed to mutate
 the IRModule by applying different transformations to these two types of functions.

 For example, ``relax.LegalizeOps`` pass mutates the IRModule by lowering relax operators, adding corresponding TIR PrimFunc into the IRModule, and replacing the relax operators
 with calls to the lowered TIR PrimFunc. Another example is operator fusion pipeline in relax (including ``relax.FuseOps`` and ``relax.FuseTIR``), which fuses multiple consecutive tensor operations
 into one. Different from the previous implementations, relax fusion pipeline analyzes the pattern of TIR functions and detects the best fusion rules automatically rather
 than human-defined operator fusion patterns.

 Target Translation
 ~~~~~~~~~~~~~~~~~~

 The target translation phase transforms an IRModule to the corresponding target executable format.
 For backends such as x86 and ARM, we use the LLVM IRBuilder to build in-memory LLVM IR.
 We can also generate source-level languages such as CUDA C and OpenCL.
 Finally, we support direct translations of a Relax function (sub-graph) to specific targets via external code generators.
 It is important that the final code generation phase is as lightweight as possible. Vast majority of transformations
 and lowering should be performed before the target translation phase.

 We also provide a Target structure to specify the compilation target.
 The transformations before the target translation phase can also be affected by the target — for example,
 a target's vector length would change the vectorization behavior.


 Runtime Execution
 ~~~~~~~~~~~~~~~~~

 The main goal of TVM's runtime is to provide a minimal API for loading and executing the compiled artifact in a language of their choice, including Python, C++, Rust, Go, Java, and JavaScript. The code snippet below shows such an example in Python:

 .. code-block:: python

     import tvm
     # Example runtime execution program in python, with type annotated
     mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
     arr: tvm.runtime.Tensor = tvm.runtime.tensor([1, 2, 3], device=tvm.cuda(0))
     fun: tvm.runtime.PackedFunc = mod["addone"]
     fun(arr)
     print(arr.numpy())


 :py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.

 :py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the
 following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.Tensor, and other sub-classes of runtime.Object.

 :py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.

 The above example only deals with a simple `addone` function. The code snippet below gives an example of an end-to-end model execution using the same interface:

 .. code-block:: python

    import tvm
    # Example runtime execution program in python, with types annotated
    factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
    # Create a stateful graph execution module for resnet18 on cuda(0)
    gmod: tvm.runtime.Module = factory["resnet18"](tvm.cuda(0))
    data: tvm.runtime.Tensor = get_input_data()
    # set input
    gmod["set_input"](0, data)
    # execute the model
    gmod["run"]()
    # get the output
    result = gmod["get_output"](0).numpy()

 The main take away is that runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs (such as addone), as well as the end-to-end models.

 Summary and Discussions
 ~~~~~~~~~~~~~~~~~~~~~~~

 In summary, the key data structures in the compilation flows are:

 - IRModule: contains relax.Function and tir.PrimFunc
 - runtime.Module: contains runtime.PackedFunc

 Most parts of the compilation are transformations among the key data structures.

 - relax/transform and tir/transform are deterministic rule-based transformations
 - meta-schedule contains the search-based transformations

 Finally, the compilation flow example is only a typical use-case of the TVM stack.
 We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy,
 except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:

 - Directly construct IRModule using the python API.
 - Compose a custom set of transformations(e.g. customize quantization).
 - Manipulate the IR directly using TVM's python API.


 tvm/support
 -----------
 The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.


 tvm/runtime
 -----------

 The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts.
 The runtime defines a stable standard set of C APIs to interface with frontend languages such as Python and Rust.

 `runtime::Object` is one of the primary data structures in TVM runtime besides the `ffi::Function`.
 It is a reference-counted base class with a type index to support runtime type checking and downcasting.
 The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.

 Besides deployment use-cases, the compiler itself also makes heavy use of TVM's runtime mechanism.
 All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the Python frontend.
 We use the PackedFunc mechanism to expose various APIs to the frontend.

 Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl).
 These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.

 `runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote
 device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends
 for learning-based optimizations.

 .. toctree::
    :maxdepth: 1

    runtime

 .. toctree::
    :maxdepth: 1

    introduction_to_module_serialization
    device_target_interactions

 ..  TODO(tvm-team) add a section about relax vm here

 tvm/node
 --------
 The node module adds additional features on top of the `runtime::Object` for IR data structures.
 The main features include reflection, serialization, structural equivalence, and hashing.

 Thanks to the node module, we can directly access any field of the TVM's IRNode by their name in Python.

 .. code-block:: python

     x = tvm.tir.Var("x", "int32")
     y = tvm.tir.Add(x, x)
     # a and b are fields of a tir.Add node
     # we can directly use the field name to access the IR structures
     assert y.a == x

 We can also serialize arbitrary IR node into a JSON format, and load them back.
 The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.

 tvm/ir
 ------
 The `tvm/ir` folder contains the unified data structure and interfaces across all IR function variants.
 The components in `tvm/ir` are shared by `tvm/relax` and `tvm/tir`, notable ones include

 - IRModule
 - Type
 - PassContext and Pass
 - Op

 Different variants of functions(e.g. relax.Function and tir.PrimFunc) can co-exist in an IRModule.
 While these variants may not have the same content representation, they use the same data structure to represent types.
 As a consequence, we use the same data structure to represent function (type) signatures of these variants.
 The unified type system allows one function variant to call another function
 once we clearly define the calling convention. This opens doors for future cross-function-variant optimizations.

 We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
 The following code snippet gives an example of PassContext configuration.

 .. code-block:: python

     # configure the behavior of the tir.UnrollLoop pass
     with tvm.transform.PassContext(config={"tir.UnrollLoop": { "auto_max_step": 10 }}):
         # code affected by the pass context


 Op is the common class to represent all system-defined primitive operator/intrinsics.
 Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.

 .. toctree::
    :maxdepth: 1

    pass_infra


 tvm/target
 ----------
 The target module contains all the code generators that translate an IRModule to a target runtime.Module.
 It also provides a common `Target` class that describes the target.

 .. TODO(tvm-team) add a target json description example once the new target API stablizes.


 The compilation pipeline can be customized according to the target by querying the attribute information
 in the target and builtin information registered to each target id(cuda, opencl).

 .. toctree::
    :maxdepth: 1

    device_target_interactions

 tvm/relax
 ---------

 Relax is the high-level IR used to represent the computational graph of a model. Various optimizations are defined in ``relax.transform``.
 Note that Relax usually works closely with the TensorIR IRModule, most of the transformations are applied on both Relax and TensorIR functions
 in the IRModule. Please refer to the :ref:`Relax Deep Dive <relax-deep-dive>` for more details.

 tvm/tir
 -------

 TIR contains the definition of the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
 Besides the IR data structures, the tir module also includes:

 - A set of schedule primitives to control the generated code in ``tir/schedule``.
 - A set of builtin intrinsics in ``tir/tensor_intrin``.
 - A set of analysis passes to analyze the TIR functions in ``tir/analysis``.
 - A set of transformation passes to lower or optimize the TIR functions in ``tir/transform``.

 Please refer to the :ref:`TensorIR Deep Dive <tensor-ir-deep-dive>` for more details.

 tvm/arith
 ---------

 This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
 arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
 a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.

 tvm/te and tvm/topi
 -------------------

 TE stands for Tensor Expression. TE is a domain-specific language (DSL) for describing tensor computations. Importantly, a tensor expression
 itself is not a self-contained function that can be stored into IRModule. We can use ``te.create_prim_func`` to convert a tensor expression to a ``tir::PrimFunc``
 and then integrate it into the IRModule.

 While possible to construct operators directly via TIR or tensor expressions (TE) for each use case, it is tedious to do so.
 `topi` (Tensor operator inventory) provides a set of pre-defined operators defined by numpy and found in common deep learning workloads.

 tvm/meta_schedule
 -----------------

 MetaSchedule is a system for automated search-based program optimization. It is designed to be a drop-in replacement for AutoTVM and AutoScheduler,
 and can be used to optimize TensorIR schedules. Note that MetaSchedule only works with static-shape workloads.

 tvm/dlight
 ----------

 DLight is a set of pre-defined, easy-to-use, and performant TIR schedules. DLight aims:

 - Fully support **dynamic shape workloads**.
 - **Light weight**. DLight schedules provides tuning-free or (very few-shots tuning) schedule with reasonable performance.
 - **Robust**. DLight schedules are designed to be robust and general-purpose for a single rule. And if the rule is not applicable,
   DLight not raise any error and switch to the next rule automatically.
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	Design and Architecture
	=======================

	This document is intended for developers who want to understand the architecture of Apache TVM and/or actively develop on the project.
	This page is organized as follows:

	- The `Overall Flow`_ gives an overview of the steps that TVM takes to turn a high level description of a model into a deployable module.
	To get started, please read this section first.
	- Brief introduction to the key components of the TVM stack. Feel free to also check out the :ref:`TensorIR Deep Dive <tensor-ir-deep-dive>`
	and :ref:`Relax Deep Dive <relax-deep-dive>` for more details about the two major components in the TVM stack.

	This guide provides a few complementary views of the architecture.
	First, we review a single end-to-end compilation flow and discuss the key data structures and the transformations.
	This runtime-based view focuses on the interactions of each components when running the compiler.
	Then we will review the logical modules of the codebase and their relationship. This part provides a static overarching view of the design.

	Overall Flow
	------------

	In this guide, we will study an example compilation flow in the compiler. The figure below shows the flow. At a high-level, it contains several steps:

	- Model Creation: Create the IRModule to be optimized and compiled, which contains a collection of functions that internally represent the model.
	Users can manually construct IRModule via NNModule, TVMScript, or import a pre-trained model from Relax frontend.
	- Transformation: The compiler transforms an IRModule to another functionally equivalent or approximately
	equivalent(e.g. in the case of quantization) IRModule. Many of the transformations are target (backend) independent.
	We also allow target to affect the configuration of the transformation pipeline.
	- Target Translation: The compiler translates(codegen) the IRModule to an executable format specified by the target.
	The target translation result is encapsulated as a `runtime.Module` that can be exported, loaded, and executed on the target runtime environment.
	- Runtime Execution: the user loads back a `runtime.Module` and runs the compiled functions in the supported runtime environment.


	.. figure:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/design/tvm_overall_flow.svg
	:align: center
	:width: 80%


	Key data structures
	~~~~~~~~~~~~~~~~~~~

	One of the best ways to design and understand a complex system is to identify the key data structures and APIs that
	manipulate (transform) these data structures. Once we identified the key data structures, we can then breakdown a system into logical
	components that either define a collection of key data structures or transformations among the data structures.

	IRModule is the primary data structure used across the entire stack. An IRModule (intermediate representation module)
	contains a collection of functions. Currently, we support two primary variants of functions.

	- relax::Function is a high-level functional program representation. A relax.Function represents high-level graph structure,
	usually corresponds to an end-to-end model or a sub-graph of the overall model. You can view a relax.Function as a computational
	graph with additional support for control-flow, and complex data structures.
	- tir::PrimFunc is a low-level program representation that contains elements including loop-nest choices, multi-dimensional load/store,
	threading, and vector/tensor instructions. It is usually used to represent an operator program that executes a (possibly-fused) layer in a model.

	During the compilation and transformation, all relax operators are lowered to ``tir::PrimFunc`` or ``TVM PackedFunc``, which can be executed directly
	on the target device, while the calls to relax operators are lowered to calls to low-level functions (e.g. ``R.call_tir`` or ``R.call_dps``).

	Transformations
	~~~~~~~~~~~~~~~

	Now that we have covered the key data structures, let us talk about the transformations. Each transformation could serve one of the following purposes:

	- optimization: transform a program to an equivalent, possibly more optimized version.
	- lowering: transform a program to a lower-level representation that is closer to the target.

	relax transformations
	^^^^^^^^^^^^^^^^^^^^^
	relax transformations contain a collection of passes that apply to relax functions. The optimizations include common graph-level
	optimizations such as constant folding and dead-code elimination for operators, and backend-specific optimizations such as library dispatch.

	tir transformations
	^^^^^^^^^^^^^^^^^^^
	tir transformations contain a collection of passes that apply to tir functions. There are two major types of transformations:

	- TensorIR schedule: TensorIR schedules are designed to optimize the TensorIR functions for a specific target, with user-guided instructions and control how the target code is generated.
	For CPU targets, TIR PrimFunc can generate valid code and execute on the target device without schedule but with very-low performance. However, for GPU targets, the schedule is essential
	for generating valid code with thread bindings. For more details, please refer to the :ref:`TensorIR Transformation <tir-transform>` section. Additionally, we provides ``MetaSchedule`` to
	automate the search of TensorIR schedule.
	- Lowering Passes: These passes usually perform after the schedule is applied, transforming a TIR PrimFunc into another functionally equivalent PrimFunc, but closer to the
	target-specific representation. For example, there are passes to flatten multi-dimensional access to one-dimensional pointer access, to expand the intrinsics into target-specific ones,
	and to decorate the function entry to meet the runtime calling convention.

	Many low-level optimizations can be handled in the target phase by the LLVM, CUDA C, and other target compilers. As a result, we leave low-level optimizations such as register allocation
	to the downstream compilers and only focus on optimizations that are not covered by them.

	cross-level transformations
	^^^^^^^^^^^^^^^^^^^^^^^^^^^
	Apache TVM brings a unity strategy to optimize the end-to-end models. As the IRModule includes both relax and tir functions, the cross-level transformations are designed to mutate
	the IRModule by applying different transformations to these two types of functions.

	For example, ``relax.LegalizeOps`` pass mutates the IRModule by lowering relax operators, adding corresponding TIR PrimFunc into the IRModule, and replacing the relax operators
	with calls to the lowered TIR PrimFunc. Another example is operator fusion pipeline in relax (including ``relax.FuseOps`` and ``relax.FuseTIR``), which fuses multiple consecutive tensor operations
	into one. Different from the previous implementations, relax fusion pipeline analyzes the pattern of TIR functions and detects the best fusion rules automatically rather
	than human-defined operator fusion patterns.

	Target Translation
	~~~~~~~~~~~~~~~~~~

	The target translation phase transforms an IRModule to the corresponding target executable format.
	For backends such as x86 and ARM, we use the LLVM IRBuilder to build in-memory LLVM IR.
	We can also generate source-level languages such as CUDA C and OpenCL.
	Finally, we support direct translations of a Relax function (sub-graph) to specific targets via external code generators.
	It is important that the final code generation phase is as lightweight as possible. Vast majority of transformations
	and lowering should be performed before the target translation phase.

	We also provide a Target structure to specify the compilation target.
	The transformations before the target translation phase can also be affected by the target — for example,
	a target's vector length would change the vectorization behavior.


	Runtime Execution
	~~~~~~~~~~~~~~~~~

	The main goal of TVM's runtime is to provide a minimal API for loading and executing the compiled artifact in a language of their choice, including Python, C++, Rust, Go, Java, and JavaScript. The code snippet below shows such an example in Python:

	.. code-block:: python

	import tvm
	# Example runtime execution program in python, with type annotated
	mod: tvm.runtime.Module = tvm.runtime.load_module("compiled_artifact.so")
	arr: tvm.runtime.Tensor = tvm.runtime.tensor([1, 2, 3], device=tvm.cuda(0))
	fun: tvm.runtime.PackedFunc = mod["addone"]
	fun(arr)
	print(arr.numpy())


	:py:class:`tvm.runtime.Module` encapsulates the result of compilation. A runtime.Module contains a GetFunction method to obtain PackedFuncs by name.

	:py:class:`tvm.runtime.PackedFunc` is a type-erased function interface for both the generated functions. A runtime.PackedFunc can take arguments and return values with the
	following types: POD types(int, float), string, runtime.PackedFunc, runtime.Module, runtime.Tensor, and other sub-classes of runtime.Object.

	:py:class:`tvm.runtime.Module` and :py:class:`tvm.runtime.PackedFunc` are powerful mechanisms to modularize the runtime. For example, to get the above `addone` function on CUDA, we can use LLVM to generate the host-side code to compute the launching parameters(e.g. size of the thread groups) and then call into another PackedFunc from a CUDAModule that is backed by the CUDA driver API. The same mechanism can be used for OpenCL kernels.

	The above example only deals with a simple `addone` function. The code snippet below gives an example of an end-to-end model execution using the same interface:

	.. code-block:: python

	import tvm
	# Example runtime execution program in python, with types annotated
	factory: tvm.runtime.Module = tvm.runtime.load_module("resnet18.so")
	# Create a stateful graph execution module for resnet18 on cuda(0)
	gmod: tvm.runtime.Module = factory["resnet18"](tvm.cuda(0))
	data: tvm.runtime.Tensor = get_input_data()
	# set input
	gmod["set_input"](0, data)
	# execute the model
	gmod["run"]()
	# get the output
	result = gmod["get_output"](0).numpy()

	The main take away is that runtime.Module and runtime.PackedFunc are sufficient to encapsulate both operator level programs (such as addone), as well as the end-to-end models.

	Summary and Discussions
	~~~~~~~~~~~~~~~~~~~~~~~

	In summary, the key data structures in the compilation flows are:

	- IRModule: contains relax.Function and tir.PrimFunc
	- runtime.Module: contains runtime.PackedFunc

	Most parts of the compilation are transformations among the key data structures.

	- relax/transform and tir/transform are deterministic rule-based transformations
	- meta-schedule contains the search-based transformations

	Finally, the compilation flow example is only a typical use-case of the TVM stack.
	We expose these key data structures and transformations to python and C++ APIs. As a result, you can use TVM just like the way you use numpy,
	except that the data structure of interest changes from the numpy.ndarray to tvm.IRModule. Here are some example use-cases:

	- Directly construct IRModule using the python API.
	- Compose a custom set of transformations(e.g. customize quantization).
	- Manipulate the IR directly using TVM's python API.


	tvm/support
	-----------
	The support module contains the most common utilities for the infrastructure, such as generic arena allocator, socket, and logging.


	tvm/runtime
	-----------

	The runtime serves as the foundation of the TVM stack. It provides the mechanism to load and execute compiled artifacts.
	The runtime defines a stable standard set of C APIs to interface with frontend languages such as Python and Rust.

	`runtime::Object` is one of the primary data structures in TVM runtime besides the `ffi::Function`.
	It is a reference-counted base class with a type index to support runtime type checking and downcasting.
	The object system allows the developer to introduce new data structures to the runtime, such as Array, Map, and new IR data structures.

	Besides deployment use-cases, the compiler itself also makes heavy use of TVM's runtime mechanism.
	All of the IR data structures are subclasses of `runtime::Object`, as a result, they can be directly accessed and manipulated from the Python frontend.
	We use the PackedFunc mechanism to expose various APIs to the frontend.

	Runtime support for different hardware backends are defined in subdirectories of runtime(e.g. runtime/opencl).
	These hardware-specific runtime modules define APIs for device memory allocation and device function serialization.

	`runtime/rpc` implements an RPC support for PackedFunc. We can use the RPC mechanism to send a cross-compiled library to a remote
	device and benchmark the execution performance. The rpc infrastructure enables data collection from a wide range of hardware backends
	for learning-based optimizations.

	.. toctree::
	:maxdepth: 1

	runtime

	.. toctree::
	:maxdepth: 1

	introduction_to_module_serialization
	device_target_interactions

	.. TODO(tvm-team) add a section about relax vm here

	tvm/node
	--------
	The node module adds additional features on top of the `runtime::Object` for IR data structures.
	The main features include reflection, serialization, structural equivalence, and hashing.

	Thanks to the node module, we can directly access any field of the TVM's IRNode by their name in Python.

	.. code-block:: python

	x = tvm.tir.Var("x", "int32")
	y = tvm.tir.Add(x, x)
	# a and b are fields of a tir.Add node
	# we can directly use the field name to access the IR structures
	assert y.a == x

	We can also serialize arbitrary IR node into a JSON format, and load them back.
	The ability to save/store, and inspect an IR node provides a foundation for making the compiler more accessible.

	tvm/ir
	------
	The `tvm/ir` folder contains the unified data structure and interfaces across all IR function variants.
	The components in `tvm/ir` are shared by `tvm/relax` and `tvm/tir`, notable ones include

	- IRModule
	- Type
	- PassContext and Pass
	- Op

	Different variants of functions(e.g. relax.Function and tir.PrimFunc) can co-exist in an IRModule.
	While these variants may not have the same content representation, they use the same data structure to represent types.
	As a consequence, we use the same data structure to represent function (type) signatures of these variants.
	The unified type system allows one function variant to call another function
	once we clearly define the calling convention. This opens doors for future cross-function-variant optimizations.

	We also provide a unified PassContext for configuring the pass behavior, and common composite passes to execute a pass pipeline.
	The following code snippet gives an example of PassContext configuration.

	.. code-block:: python

	# configure the behavior of the tir.UnrollLoop pass
	with tvm.transform.PassContext(config={"tir.UnrollLoop": { "auto_max_step": 10 }}):
	# code affected by the pass context


	Op is the common class to represent all system-defined primitive operator/intrinsics.
	Developers can register new Ops as well as their additional attributes(e.g. whether the Op is elementwise) to the system.

	.. toctree::
	:maxdepth: 1

	pass_infra


	tvm/target
	----------
	The target module contains all the code generators that translate an IRModule to a target runtime.Module.
	It also provides a common `Target` class that describes the target.

	.. TODO(tvm-team) add a target json description example once the new target API stablizes.


	The compilation pipeline can be customized according to the target by querying the attribute information
	in the target and builtin information registered to each target id(cuda, opencl).

	.. toctree::
	:maxdepth: 1

	device_target_interactions

	tvm/relax
	---------

	Relax is the high-level IR used to represent the computational graph of a model. Various optimizations are defined in ``relax.transform``.
	Note that Relax usually works closely with the TensorIR IRModule, most of the transformations are applied on both Relax and TensorIR functions
	in the IRModule. Please refer to the :ref:`Relax Deep Dive <relax-deep-dive>` for more details.

	tvm/tir
	-------

	TIR contains the definition of the low-level program representations. We use `tir::PrimFunc` to represent functions that can be transformed by TIR passes.
	Besides the IR data structures, the tir module also includes:

	- A set of schedule primitives to control the generated code in ``tir/schedule``.
	- A set of builtin intrinsics in ``tir/tensor_intrin``.
	- A set of analysis passes to analyze the TIR functions in ``tir/analysis``.
	- A set of transformation passes to lower or optimize the TIR functions in ``tir/transform``.

	Please refer to the :ref:`TensorIR Deep Dive <tensor-ir-deep-dive>` for more details.

	tvm/arith
	---------

	This module is closely tied to the TIR. One of the key problems in the low-level code generation is the analysis of the indices'
	arithmetic properties — the positiveness, variable bound, and the integer set that describes the iterator space. arith module provides
	a collection of tools that do (primarily integer) analysis. A TIR pass can use these analyses to simplify and optimize the code.

	tvm/te and tvm/topi
	-------------------

	TE stands for Tensor Expression. TE is a domain-specific language (DSL) for describing tensor computations. Importantly, a tensor expression
	itself is not a self-contained function that can be stored into IRModule. We can use ``te.create_prim_func`` to convert a tensor expression to a ``tir::PrimFunc``
	and then integrate it into the IRModule.

	While possible to construct operators directly via TIR or tensor expressions (TE) for each use case, it is tedious to do so.
	`topi` (Tensor operator inventory) provides a set of pre-defined operators defined by numpy and found in common deep learning workloads.

	tvm/meta_schedule
	-----------------

	MetaSchedule is a system for automated search-based program optimization. It is designed to be a drop-in replacement for AutoTVM and AutoScheduler,
	and can be used to optimize TensorIR schedules. Note that MetaSchedule only works with static-shape workloads.

	tvm/dlight
	----------

	DLight is a set of pre-defined, easy-to-use, and performant TIR schedules. DLight aims:

	- Fully support dynamic shape workloads.
	- Light weight. DLight schedules provides tuning-free or (very few-shots tuning) schedule with reasonable performance.
	- Robust. DLight schedules are designed to be robust and general-purpose for a single rule. And if the rule is not applicable,
	DLight not raise any error and switch to the next rule automatically.