docs/dev/virtual_machine.rst - tvm - Git at Google

 ..  Licensed to the Apache Software Foundation (ASF) under one
     or more contributor license agreements.  See the NOTICE file
     distributed with this work for additional information
     regarding copyright ownership.  The ASF licenses this file
     to you under the Apache License, Version 2.0 (the
     "License"); you may not use this file except in compliance
     with the License.  You may obtain a copy of the License at

 ..    http://www.apache.org/licenses/LICENSE-2.0

 ..  Unless required by applicable law or agreed to in writing,
     software distributed under the License is distributed on an
     "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
     KIND, either express or implied.  See the License for the
     specific language governing permissions and limitations
     under the License.

 Putting the VM in TVM: The Relay Virtual Machine
 ================================================

 Relay, a new program representation, has enabled the representation and optimization of
 a great breadth of machine learning programs.
 Unfortunately, by supporting a more expressive set of programs, we have
 introduced several new execution challenges.

 Relay's interpreter can execute the full language but has notable limitations
 that make it unsuited for production deployments. It is structured as an inefficient
 interpreter that performs AST traversal to execute the program. This approach is conceptually
 simple but inefficient, as the AST traversal heavily relies on indirection.

 There are further challenges in compiling dynamic code, such as dynamic scheduling and allocation,
 fully dynamic tensor shapes, and control flow. The interpreter offers simple solutions
 for these, but none is sufficiently compelling or optimized.

 The second execution mechanism is the existing graph executor. In order to target Relay
 programs to this, we compile a small subset of them to the old graph format and execute
 them on the runtime. Graph executor provides a fast execution experience but only for a very limited
 subset of Relay programs.

 An alternative but not-standard approach is Relay's ahead-of-time compiler,
 which compiles a Relay program into a shared library containing an ahead-of-time
 implementation. The ahead-of-time compiler provides compelling performance
 but is difficult to extend and instrument, which can only be done by modifying the
 code generation and optimization mechanisms.

 The Relay virtual machine is intended to be a framework that balances these competing
 approaches, providing a dynamic execution environment which can be extended, instrumented,
 and integrated with other approaches like ahead-of-time compilation via a flexible extension
 mechanism.

 The virtual machine is designed to strike a balance between performance and flexibility
 when deploying and executing Relay programs, without giving up the benefits of TVM.

 Virtual machine (VM) design is a well-studied area in programming languages and systems,
 and there have been various virtual machine designs for both full-fledged
 and embedded programing languages.
 Previous language VM designs have been heavily tailored to the execution profile of traditional programs.
 Traditional programs manipulate small scalar values and consist of a large number of low-level instructions.
 The sheer quantity of instructions requires instruction execution and dispatch to be extremely efficient.
 In the context of machine learning we manipulate primarily tensor values, using a (relatively)
 low number of high level instructions. ML programs' cost centers are expensive operator invocations,
 such as GEMM or convolution, over a large input. Due to the execution profile exhibited by ML programs,
 micro-optimizations present in scalar VMs are dramatically less important.

 TVM has provided strong support for vision models,
 but we want to grow to support a wider variety of models.
 The graph executor is able to utilize the fully static nature of the input graphs to perform
 aggressive optimization such as fully static allocation, and optimal memory reuse.
 When we introduce models which make use of control flow, recursion, dynamic shapes, and dynamic
 allocation, we must change how execution works. A virtual machine for Relay is a natural choice.

 The rest of this document provides a high-level overview of the Relay
 virtual machine design and its instruction set.

 Design
 ------

 The VM's design is focused on simplicity without sacrificing performance.
 In order to accomplish this we have focused on designing a tensor VM rather than a scalar VM.

 In the tensor VM setting, we optimize for cheap “allocation” of objects (by trying to avoid real allocation),
 reuse of static fragments, and the ability to do dynamic shape (i.e jagged tensors).

 Instruction Set
 ~~~~~~~~~~~~~~~

 The choices of an instruction set and instruction representation are the most critical design decisions for a VM.
 The current representation of the instructions is a tagged union containing the op-code and the data payload.  An important design decision is the level of abstraction of the instructions (RISC vs. CISC) and how they take their data (fixed-width instruction encoding vs. variable-length encoding). The current version is closer to CISC, with complex instructions like AllocTensor, and is variable-length due to the inclusion of the shape as part of the instruction. The current instruction set is very high-level and corresponds roughly to high-level operations in Relay.

 Ret
 ^^^
 **Arguments**:
 ::

   RegName dst
   RegName result

 Returns the object in register ``result`` to caller's register ``dst``.

 InvokePacked
 ^^^^^^^^^^^^
 **Arguments**:
 ::

   Index packed_index
   Index arity
   Index output_size
   RegName* packed_args

 Invoke the packed function denoted by ``packed_index``. The ``arity``
 and ``output_size`` are used to inform the VM how many inputs and
 outputs to expect. ``packed_args`` stores the list of argument registers. Note ``Index``
 is an alias of ``int64_t``, and it will be used in other instructions as well.

 AllocTensor
 ^^^^^^^^^^^
 **Arguments**:
 ::

   RegName dst
   RegName storage
   uint32_t ndim
   int64_t* shape
   DLDataType dtype

 Allocate a tensor value of using constant shape (stored in ``shape``) and ``dtype``
 from the given storage block, ``storage``. The result is saved to register ``dst``.

 AllocTensorReg
 ^^^^^^^^^^^^^^
 **Arguments**:
 ::

   RegName dst
   RegName storage
   RegName shape_register
   DLDataType dtype

 Allocate a tensor value of the appropriate shape (stored in ``shape_register``)
 and ``dtype`` from the given storage block (stored in ``storage``). The result is saved to register ``dst``.

 AllocStorage
 ^^^^^^^^^^^^
 **Arguments**:
 ::

   RegName dst
   RegName size
   RegName alignment
   DLDataType dtype_hint

 Allocate a storage block with the given ``size``, ``alignment`` and data type, ``dtype_hint``.
 The allocated storage block is stored in register ``dst``.

 AllocADT
 ^^^^^^^^
 **Arguments**:
 ::

   RegName dst
   Index tag
   Index num_fields
   RegName* datatype_fields

 Allocate a data type with the tag ``tag`` using the ``num_fields`` entries
 from registers ``datatype_fields``. The result is saved to register ``dst``.

 AllocClosure
 ^^^^^^^^^^^^
 **Arguments**:
 ::

   RegName dst
   Index clo_index
   Index num_freevar
   RegName* free_vars;

 Allocate a closure with the VMFunction at ``clo_index`` as
 its code, and the ``num_freevar`` entries from registers in
 ``free_vars``. The result is saved to register ``dst``.

 GetField
 ^^^^^^^^
 **Arguments**:
 ::

   RegName dst
   RegName object
   Index field_index

 Get the field value with index ``field_index`` from ``object``. And saves the result to register ``dst``.

 If
 ^^
 **Arguments**:
 ::

   RegName test
   RegName target
   Index true_offset
   Index false_offset

 Check if the object at register ``test`` is equal to ``target``.
 If equal, relative jump by ``true_offset``, else relative
 jump by ``false_offset``.

 GetTag
 ^^^^^^
 **Arguments**:
 ::

   RegName object
   RegName dst

 Get the object tag for ADT object in register ``object``. And saves the reult to register ``dst``.

 Fatal
 ^^^^^
 Fail the virtual machine execution.

 Goto
 ^^^^
 **Arguments**:
 ::

   Index pc_offset

 Relative unconditional jump by ``pc_offset``.

 Invoke
 ^^^^^^
 **Arguments**:
 ::

   Index func_index

 Invoke function at ``func_index``, consumes the number of arguments contained in the VMFunction's
 arity field.

 InvokeClosure
 ^^^^^^^^^^^^^
 **Arguments**:
 ::

     RegName closure
     Index num_closure_args
     RegName* closure_args

 Invokes ``closure``, consuming the number of arguments declared in the closure's VMFunction.

 LoadConst
 ^^^^^^^^^
 **Arguments**:
 ::

   RegName dst
   Index const_index

 Load the constant at ``const_index`` from the constant pool. The result is saved to register ``dst``.

 LoadConsti
 ^^^^^^^^^^
 **Arguments**:
 ::

   Index val
   RegName dst

 Load the constant integer ``val`` to register ``dst``. The result is a 0-rank tensor.

 Object Representation
 ~~~~~~~~~~~~~~~~~~~~~
 We leverage the object protocol to represent the objects that are used by the
 VM.

 Currently, three types of objects, ``NDArray``, ``ADT``, and ``Closure`` objects, are used
 to represent tensor, tuple/list, and closure data, respectively. More details
 for each of them can be found at `include/tvm/runtime/ndarray.h`_,
 `include/tvm/runtime/vm/vm.h`_, and `include/tvm/runtime/container.h`_, respectively.

 .. _include/tvm/runtime/ndarray.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/ndarray.h

 .. _include/tvm/runtime/vm/vm.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/vm/vm.h

 .. _include/tvm/runtime/container.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/container.h

 Stack and State
 ~~~~~~~~~~~~~~~

 The Relay VM maintains a stack frame, which contains information about how to resume the
 previous call. Registers are allocated in a continuous space (virtual register file) for each function.

 We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter).

 .. code-block:: c

     struct VirtualMachine {
       ...
       std::vector<VMFrame> frames;
       ...
       // Current function.
       size_t func_index;
       // Pointer into the current function's instructions.
       const Instruction* code;
       // Current program counter relative to the code pointer.
       size_t pc;
       ...
     };


 Dispatch Loop
 ~~~~~~~~~~~~~
 A critical piece of a VM is the dispatch loop. The dispatch loop usually dominates the execution time of a
 virtual machine, but we have experimentally found this not to be the case for Relay. We have just implemented
 a simple ``switch``/``goto`` dispatch loop which dispatches based on instruction op code.

 This loop is implemented by ``VirtualMachine::Run()``.

 VM Compiler
 ~~~~~~~~~~~

 An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode.
 The VM compiler transforms a ``tvm::relay::Module`` into a ``tvm::relay::vm::Executable``. The executable
 contains a set of compiled functions, the compiled functions are contained in ``tvm::relay::vm::Function``.
 The functions contain metadata about the function as well as its compiled bytecode. The emitted executable
 object then can be loaded and run by a ``tvm::relay::vm::VirtualMachine`` object. For full definitions of the
 data structures, please see `include/tvm/runtime/vm/executable.h`_ and `include/tvm/runtime/vm/vm.h`_.

 .. _include/tvm/runtime/vm/executable.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/vm/executable.h

 Optimizations
 ~~~~~~~~~~~~~

 There are quite a few optimizations required by the VM compiler. Each of them
 is implemented as a pass which is managed by the Relay pass manager.

 Optimizations marked with `TODO` are not implemented yet.

 - A-Normal Form
 - Lambda Lift (see `src/relay/vm/lambda_lift.cc`_)
 - Inline Primitives (see `src/relay/vm/inline_primitives.cc`_)
 - Constant Pool Layout (see `src/relay/backend/vm/compiler.cc`_)
 - Tail Call Optimization (TODO)
 - Liveness Analysis (TODO)

 .. _src/relay/vm/lambda_lift.cc: https://github.com/apache/tvm/blob/main/src/relay/backend/vm/lambda_lift.cc

 .. _src/relay/vm/inline_primitives.cc: https://github.com/apache/tvm/blob/main/src/relay/backend/vm/inline_primitives.cc

 .. _src/relay/backend/vm/compiler.cc: https://github.com/apache/tvm/blob/main/src/relay/backend/vm/compiler.cc

 Serialization
 ~~~~~~~~~~~~~

 Serializing and deserializing the executable generated by the Relay VM compiler is a must as
 we may want to save the model to the disk and perform inference later. Previously, Relay has produced
 a serialized form in a json file for the graph executor. However, the same format is not directly
 applicable to the VM as it emits bytecode instead of graph-style programs.
 Serialization of an executable essentially needs to handle both model specific
 (i.e. weights and kernels) and VM related (i.e. bytecode and global function names) data.

 For kernels, we can conveniently leverage existing TVM infra to save and load
 the compiled library module. Here we only focus on serializing other several
 components in a binary format that is organized with the following sections in order.

 - Global section. This section contains the globals (function names) used by the virtual machine.

 - Constant section. This section is used to store the constant pool (i.e. weights of the model)
   for a virtual machine.

 - Primitive name section. This section is introduced to accommodate the list of primitive
   operator names that will be invoked by the virtual machine, i.e. the names
   starting with ``fused_``. The primitive names are used as symbols to look up
   function pointers in the compiled kernel library.

 - Code section. The VM functions, including bytecode, are sitting in this section. The dispatching
   loop iterates through this section to fetch instructions for execution.

 Hence, unlike the graph executor artifact that contains weight (.params), graph json (.json),
 and compiled kernel library (.so), the serialized executable artifact is composed of the Relay
 object file (.ro) and the compiled kernel library (.so).

 A ``save`` function is implemented to store the executable to the disk and
 serialize it into the above format. Meanwhile, a ``load_exec`` function is used to
 load the serialized kernel binary and executable related binary code, which will be again used to
 instantiate a VM object. Please refer to the `test_vm_serialization.py`_ file for more
 examples.

 .. _test_vm_serialization.py: https://github.com/apache/tvm/blob/main/tests/python/relay/test_vm_serialization.py

 Unresolved Questions
 ~~~~~~~~~~~~~~~~~~~~

 How do we handle dynamic shapes?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 TODO

 How can we modify the VM to support JIT compilation of certain code paths?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
 to be very flexible so we can modify it for future experiments.

 How do we support heterogenous execution?
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

 Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
 In order to do this properly we need to run the device annotation and copying passes.
	.. Licensed to the Apache Software Foundation (ASF) under one
	or more contributor license agreements. See the NOTICE file
	distributed with this work for additional information
	regarding copyright ownership. The ASF licenses this file
	to you under the Apache License, Version 2.0 (the
	"License"); you may not use this file except in compliance
	with the License. You may obtain a copy of the License at

	.. http://www.apache.org/licenses/LICENSE-2.0

	.. Unless required by applicable law or agreed to in writing,
	software distributed under the License is distributed on an
	"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
	KIND, either express or implied. See the License for the
	specific language governing permissions and limitations
	under the License.

	Putting the VM in TVM: The Relay Virtual Machine
	================================================

	Relay, a new program representation, has enabled the representation and optimization of
	a great breadth of machine learning programs.
	Unfortunately, by supporting a more expressive set of programs, we have
	introduced several new execution challenges.

	Relay's interpreter can execute the full language but has notable limitations
	that make it unsuited for production deployments. It is structured as an inefficient
	interpreter that performs AST traversal to execute the program. This approach is conceptually
	simple but inefficient, as the AST traversal heavily relies on indirection.

	There are further challenges in compiling dynamic code, such as dynamic scheduling and allocation,
	fully dynamic tensor shapes, and control flow. The interpreter offers simple solutions
	for these, but none is sufficiently compelling or optimized.

	The second execution mechanism is the existing graph executor. In order to target Relay
	programs to this, we compile a small subset of them to the old graph format and execute
	them on the runtime. Graph executor provides a fast execution experience but only for a very limited
	subset of Relay programs.

	An alternative but not-standard approach is Relay's ahead-of-time compiler,
	which compiles a Relay program into a shared library containing an ahead-of-time
	implementation. The ahead-of-time compiler provides compelling performance
	but is difficult to extend and instrument, which can only be done by modifying the
	code generation and optimization mechanisms.

	The Relay virtual machine is intended to be a framework that balances these competing
	approaches, providing a dynamic execution environment which can be extended, instrumented,
	and integrated with other approaches like ahead-of-time compilation via a flexible extension
	mechanism.

	The virtual machine is designed to strike a balance between performance and flexibility
	when deploying and executing Relay programs, without giving up the benefits of TVM.

	Virtual machine (VM) design is a well-studied area in programming languages and systems,
	and there have been various virtual machine designs for both full-fledged
	and embedded programing languages.
	Previous language VM designs have been heavily tailored to the execution profile of traditional programs.
	Traditional programs manipulate small scalar values and consist of a large number of low-level instructions.
	The sheer quantity of instructions requires instruction execution and dispatch to be extremely efficient.
	In the context of machine learning we manipulate primarily tensor values, using a (relatively)
	low number of high level instructions. ML programs' cost centers are expensive operator invocations,
	such as GEMM or convolution, over a large input. Due to the execution profile exhibited by ML programs,
	micro-optimizations present in scalar VMs are dramatically less important.

	TVM has provided strong support for vision models,
	but we want to grow to support a wider variety of models.
	The graph executor is able to utilize the fully static nature of the input graphs to perform
	aggressive optimization such as fully static allocation, and optimal memory reuse.
	When we introduce models which make use of control flow, recursion, dynamic shapes, and dynamic
	allocation, we must change how execution works. A virtual machine for Relay is a natural choice.

	The rest of this document provides a high-level overview of the Relay
	virtual machine design and its instruction set.

	Design
	------

	The VM's design is focused on simplicity without sacrificing performance.
	In order to accomplish this we have focused on designing a tensor VM rather than a scalar VM.

	In the tensor VM setting, we optimize for cheap “allocation” of objects (by trying to avoid real allocation),
	reuse of static fragments, and the ability to do dynamic shape (i.e jagged tensors).

	Instruction Set
	~~~~~~~~~~~~~~~

	The choices of an instruction set and instruction representation are the most critical design decisions for a VM.
	The current representation of the instructions is a tagged union containing the op-code and the data payload. An important design decision is the level of abstraction of the instructions (RISC vs. CISC) and how they take their data (fixed-width instruction encoding vs. variable-length encoding). The current version is closer to CISC, with complex instructions like AllocTensor, and is variable-length due to the inclusion of the shape as part of the instruction. The current instruction set is very high-level and corresponds roughly to high-level operations in Relay.

	Ret
	^^^
	Arguments:
	::

	RegName dst
	RegName result

	Returns the object in register ``result`` to caller's register ``dst``.

	InvokePacked
	^^^^^^^^^^^^
	Arguments:
	::

	Index packed_index
	Index arity
	Index output_size
	RegName* packed_args

	Invoke the packed function denoted by ``packed_index``. The ``arity``
	and ``output_size`` are used to inform the VM how many inputs and
	outputs to expect. ``packed_args`` stores the list of argument registers. Note ``Index``
	is an alias of ``int64_t``, and it will be used in other instructions as well.

	AllocTensor
	^^^^^^^^^^^
	Arguments:
	::

	RegName dst
	RegName storage
	uint32_t ndim
	int64_t* shape
	DLDataType dtype

	Allocate a tensor value of using constant shape (stored in ``shape``) and ``dtype``
	from the given storage block, ``storage``. The result is saved to register ``dst``.

	AllocTensorReg
	^^^^^^^^^^^^^^
	Arguments:
	::

	RegName dst
	RegName storage
	RegName shape_register
	DLDataType dtype

	Allocate a tensor value of the appropriate shape (stored in ``shape_register``)
	and ``dtype`` from the given storage block (stored in ``storage``). The result is saved to register ``dst``.

	AllocStorage
	^^^^^^^^^^^^
	Arguments:
	::

	RegName dst
	RegName size
	RegName alignment
	DLDataType dtype_hint

	Allocate a storage block with the given ``size``, ``alignment`` and data type, ``dtype_hint``.
	The allocated storage block is stored in register ``dst``.

	AllocADT
	^^^^^^^^
	Arguments:
	::

	RegName dst
	Index tag
	Index num_fields
	RegName* datatype_fields

	Allocate a data type with the tag ``tag`` using the ``num_fields`` entries
	from registers ``datatype_fields``. The result is saved to register ``dst``.

	AllocClosure
	^^^^^^^^^^^^
	Arguments:
	::

	RegName dst
	Index clo_index
	Index num_freevar
	RegName* free_vars;

	Allocate a closure with the VMFunction at ``clo_index`` as
	its code, and the ``num_freevar`` entries from registers in
	``free_vars``. The result is saved to register ``dst``.

	GetField
	^^^^^^^^
	Arguments:
	::

	RegName dst
	RegName object
	Index field_index

	Get the field value with index ``field_index`` from ``object``. And saves the result to register ``dst``.

	If
	^^
	Arguments:
	::

	RegName test
	RegName target
	Index true_offset
	Index false_offset

	Check if the object at register ``test`` is equal to ``target``.
	If equal, relative jump by ``true_offset``, else relative
	jump by ``false_offset``.

	GetTag
	^^^^^^
	Arguments:
	::

	RegName object
	RegName dst

	Get the object tag for ADT object in register ``object``. And saves the reult to register ``dst``.

	Fatal
	^^^^^
	Fail the virtual machine execution.

	Goto
	^^^^
	Arguments:
	::

	Index pc_offset

	Relative unconditional jump by ``pc_offset``.

	Invoke
	^^^^^^
	Arguments:
	::

	Index func_index

	Invoke function at ``func_index``, consumes the number of arguments contained in the VMFunction's
	arity field.

	InvokeClosure
	^^^^^^^^^^^^^
	Arguments:
	::

	RegName closure
	Index num_closure_args
	RegName* closure_args

	Invokes ``closure``, consuming the number of arguments declared in the closure's VMFunction.

	LoadConst
	^^^^^^^^^
	Arguments:
	::

	RegName dst
	Index const_index

	Load the constant at ``const_index`` from the constant pool. The result is saved to register ``dst``.

	LoadConsti
	^^^^^^^^^^
	Arguments:
	::

	Index val
	RegName dst

	Load the constant integer ``val`` to register ``dst``. The result is a 0-rank tensor.

	Object Representation
	~~~~~~~~~~~~~~~~~~~~~
	We leverage the object protocol to represent the objects that are used by the
	VM.

	Currently, three types of objects, ``NDArray``, ``ADT``, and ``Closure`` objects, are used
	to represent tensor, tuple/list, and closure data, respectively. More details
	for each of them can be found at `include/tvm/runtime/ndarray.h`_,
	`include/tvm/runtime/vm/vm.h`_, and `include/tvm/runtime/container.h`_, respectively.

	.. _include/tvm/runtime/ndarray.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/ndarray.h

	.. _include/tvm/runtime/vm/vm.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/vm/vm.h

	.. _include/tvm/runtime/container.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/container.h

	Stack and State
	~~~~~~~~~~~~~~~

	The Relay VM maintains a stack frame, which contains information about how to resume the
	previous call. Registers are allocated in a continuous space (virtual register file) for each function.

	We keep track of a set of Relay functions we have called, a pointer into its bytecode, an offset into the byte code (known as the program counter).

	.. code-block:: c

	struct VirtualMachine {
	...
	std::vector<VMFrame> frames;
	...
	// Current function.
	size_t func_index;
	// Pointer into the current function's instructions.
	const Instruction* code;
	// Current program counter relative to the code pointer.
	size_t pc;
	...
	};


	Dispatch Loop
	~~~~~~~~~~~~~
	A critical piece of a VM is the dispatch loop. The dispatch loop usually dominates the execution time of a
	virtual machine, but we have experimentally found this not to be the case for Relay. We have just implemented
	a simple ``switch``/``goto`` dispatch loop which dispatches based on instruction op code.

	This loop is implemented by ``VirtualMachine::Run()``.

	VM Compiler
	~~~~~~~~~~~

	An important part of this infrastructure is a compiler from Relay's full IR into a sequence of bytecode.
	The VM compiler transforms a ``tvm::relay::Module`` into a ``tvm::relay::vm::Executable``. The executable
	contains a set of compiled functions, the compiled functions are contained in ``tvm::relay::vm::Function``.
	The functions contain metadata about the function as well as its compiled bytecode. The emitted executable
	object then can be loaded and run by a ``tvm::relay::vm::VirtualMachine`` object. For full definitions of the
	data structures, please see `include/tvm/runtime/vm/executable.h`_ and `include/tvm/runtime/vm/vm.h`_.

	.. _include/tvm/runtime/vm/executable.h: https://github.com/apache/tvm/blob/main/include/tvm/runtime/vm/executable.h

	Optimizations
	~~~~~~~~~~~~~

	There are quite a few optimizations required by the VM compiler. Each of them
	is implemented as a pass which is managed by the Relay pass manager.

	Optimizations marked with `TODO` are not implemented yet.

	- A-Normal Form
	- Lambda Lift (see `src/relay/vm/lambda_lift.cc`_)
	- Inline Primitives (see `src/relay/vm/inline_primitives.cc`_)
	- Constant Pool Layout (see `src/relay/backend/vm/compiler.cc`_)
	- Tail Call Optimization (TODO)
	- Liveness Analysis (TODO)

	.. _src/relay/vm/lambda_lift.cc: https://github.com/apache/tvm/blob/main/src/relay/backend/vm/lambda_lift.cc

	.. _src/relay/vm/inline_primitives.cc: https://github.com/apache/tvm/blob/main/src/relay/backend/vm/inline_primitives.cc

	.. _src/relay/backend/vm/compiler.cc: https://github.com/apache/tvm/blob/main/src/relay/backend/vm/compiler.cc

	Serialization
	~~~~~~~~~~~~~

	Serializing and deserializing the executable generated by the Relay VM compiler is a must as
	we may want to save the model to the disk and perform inference later. Previously, Relay has produced
	a serialized form in a json file for the graph executor. However, the same format is not directly
	applicable to the VM as it emits bytecode instead of graph-style programs.
	Serialization of an executable essentially needs to handle both model specific
	(i.e. weights and kernels) and VM related (i.e. bytecode and global function names) data.

	For kernels, we can conveniently leverage existing TVM infra to save and load
	the compiled library module. Here we only focus on serializing other several
	components in a binary format that is organized with the following sections in order.

	- Global section. This section contains the globals (function names) used by the virtual machine.

	- Constant section. This section is used to store the constant pool (i.e. weights of the model)
	for a virtual machine.

	- Primitive name section. This section is introduced to accommodate the list of primitive
	operator names that will be invoked by the virtual machine, i.e. the names
	starting with ``fused_``. The primitive names are used as symbols to look up
	function pointers in the compiled kernel library.

	- Code section. The VM functions, including bytecode, are sitting in this section. The dispatching
	loop iterates through this section to fetch instructions for execution.

	Hence, unlike the graph executor artifact that contains weight (.params), graph json (.json),
	and compiled kernel library (.so), the serialized executable artifact is composed of the Relay
	object file (.ro) and the compiled kernel library (.so).

	A ``save`` function is implemented to store the executable to the disk and
	serialize it into the above format. Meanwhile, a ``load_exec`` function is used to
	load the serialized kernel binary and executable related binary code, which will be again used to
	instantiate a VM object. Please refer to the `test_vm_serialization.py`_ file for more
	examples.

	.. _test_vm_serialization.py: https://github.com/apache/tvm/blob/main/tests/python/relay/test_vm_serialization.py

	Unresolved Questions
	~~~~~~~~~~~~~~~~~~~~

	How do we handle dynamic shapes?
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	TODO

	How can we modify the VM to support JIT compilation of certain code paths?
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	In the code generation space there are still many tradeoffs to be analyzed and the VM is designed
	to be very flexible so we can modify it for future experiments.

	How do we support heterogenous execution?
	^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

	Heterogenous execution should work out of the box assuming we have annotated the appropriate device copies.
	In order to do this properly we need to run the device annotation and copying passes.