blob: 885ef2c8fc0dac4591df6f4c56af3aaef7eccc79 [file] [log] [blame]
.. Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
distributed with this work for additional information
regarding copyright ownership. The ASF licenses this file
to you under the Apache License, Version 2.0 (the
"License"); you may not use this file except in compliance
with the License. You may obtain a copy of the License at
.. http://www.apache.org/licenses/LICENSE-2.0
.. Unless required by applicable law or agreed to in writing,
software distributed under the License is distributed on an
"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
KIND, either express or implied. See the License for the
specific language governing permissions and limitations
under the License.
**************************
microTVM Design Document
**************************
.. contents:: Table of Contents
:depth: 3
Background
===========
TVM is a model deployment framework that has demonstrated good performance across a wide range of
models on traditional operating systems. Given TVM's layered approach to compilation, it is a
natural extension to target bare metal devices. While most of the compilation flow does not need to
change for a proof-of-concept implementation on such devices, the runtime cannot depend on:
* **Virtual Memory**, and by extension any system-provided ``malloc``. Additionally, bare metal
devices typically have very limited memory (measured in KB). Because of this, libraries designed
for such platforms typically need to be more judicious in using memory, and need to release
memory when it is not in use.
* Traditional OS abstractions, such as **files**, **libraries**, and **kernel functions**. Some
projects implement support for these, but they are by no means standard.
* Support for programming languages other than **C**.
Such changes require a different approach from the TVM C++ runtime typically used on traditional
Operating Systems.
Typical Use
===========
This section discusses our vision of the "typical" microTVM use case. Each component used to achieve
this typical use case is intended to be designed for flexibility, but this unifying vision serves to
motivate the inclusion of each part of the design.
.. figure:: https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_workflow.svg
:align: center
:width: 85%
The parts of this process are described below:
#. **Model Import**. The user imports an existing model or describes a new model to TVM, producing a
*Relay module*.
#. **Model Transformations**. The user can apply transformations, such as quantization, to the
model. After each transformation, the user should still have a Relay module.
#. **Compilation** (Scheduling and Code Generation). TVM implements each operator into Tensor IR by
assigning a schedule and schedule configuration to each Relay operator. Then, code (C source or
compiled object) is generated for each operator.
#. **Integration**. The generated code is integrated along with the TVM C Runtime library into a
user-supplied binary project. In some cases (such as when the project is standardized across
multiple SoC/development boards), this process is handled automatically.
#. **Deployment**. The project is built and the residual firmware binary is flashed onto the device.
Model inference is driven either by TVM using an on-device RPC server, or on the device using the
on-device Graph Executor.
Design Goals
============
microTVM aims to achieve these design goals:
1. **Portable Code**. microTVM can translate any Relay model into C code that can compile with only
a C standard library.
2. **Minimal Overhead**. microTVM generates target-specific, highly optimized code. As much overhead
from the runtime should be removed.
3. **Accessible Code**. microTVM considers C source code as a first-class output mechanism so that
it is easier for a firmware engineer to understand and tweak.
Overview
========
microTVM requires changes at all levels of the TVM compiler stack. The following sub-sections enumerate
these changes at a high level, and follow-on sections discuss the specifics in more detail.
Modeling Target Platforms
-------------------------
TVM's search-based optimization approach allows it to largely avoid system-level modeling of targets
in favor of experimental results. However, some modeling is necessary in order to ensure TVM is
comparing apples-to-apples search results, and to avoid wasting time during the search by attempting
to compile invalid code for a target.
microTVM models these parts of the target:
* The CPU used, through the ``-mcpu`` and ``-march`` target flags.
* The presence or absence of accelerators, through the device components of the target (Currently
only the absence of accelerators can be expressed, but this mechanism should extend well).
microTVM aims to model these parts of the target in the future:
* Memory, modeled as a set of disjoint memory spaces, each with a label and size and prefetch/flush
behavior. Some memory may be shared with accelerators.
* Target runtime configuration (i.e. clock tree configuration, clock speed, etc). This is intended
only to contribute to the AutoTVM schedule key and not for any other use.
At this time, TVM does not intend to model:
* Size, type, or relationship of caches, with the exception of prefetching or cache flushing.
TVM Targets for microTVM
-------------------------
A central data structure in the compilation process is the ``tvm::target::Target`` class. TVM uses
Target to decide which TIR schedules to enable and how to configure the code generator. The Target
class should also uniquely identify the generated code for a particular operator, as autotuning
logs use it to rank measured performance (but see Future Work).
Targets are currently represented as strings structured similarly to command-line arguments. An
example target is shown below:
``c -keys=arm_cpu -mcpu=cortex-m7 -link-params -model=stm32f746xx -runtime=c -system-lib=1``
The relevant parts to microTVM are:
* Code generator (``llvm`` or ``c``)
* ``-mcpu=cortex-m7``: used by TOPI to enable Cortex-M schedules, and, when the C source code
generator is selected, included in the output as a comment to help identify the code and
configure the downstream C compiler.
* ``-link-params``: include parameters as global constants to load from flash.
* ``-runtime=c``: build glue code to allow operators to work with the C runtime
* ``-system-lib=1``: emit a system library (i.e. which can be loaded by calling the PackedFunc
``runtime.SystemLib``.
Writing Schedules for microTVM
------------------------------
For operations scheduled on the CPU, microTVM initially plans to make use of specialized
instructions and extern (i.e. hand-optimized) functions to achieve good performance. In TVM, this
approach is generally accomplished through tensorization, in which TVM breaks a computation into
small pieces, and a TIR extern function accelerates each small piece.
TVM currently accommodates both approaches using ``tir.call_extern``. First, a pragma is attached to
the schedule defining the extern function in portable C.
``sched[output].pragma(n, "import_c", "void call_asm(int32_t* a, int32_t* b) { /* ... */ }")``
Next, ``tensorize`` is used to split the computation.
``sched[output].tensorize(owi, gemm)``
There are a couple of caveats to this approach, all which could be resolved by linking generated
code against external libraries:
* Inline assembly is compiler-specific. While Clang and GCC have standardized on one syntax, this
may not be portable to other compilers. SDKs solve this by conditionally including a header file
depending on the compiler being used. However, taking this approach means that the generated code
needs additional compiler flags (i.e. ``-Isystempath/to/header``).
* It may be helpful to reference helper functions from the generated code (e.g. to inline common
sequences of hand-optimized assembly).
* Finally, the extern function invoked may be wholly written in an external library. If those
functions can be wholly inlined, this caveat is the same as the previous. If not, then additional
C code needs to be compiled and linked against the operator.
At present, microTVM presumes that all eligible schedules can be compiled. This means that the user-
supplied project (see next section) must include all libraries that are used by the generated code.
When not using autotuning, TVM randomly chooses a fallback schedule, so all libraries would need to
be supported. When using autotuning, TVM selects the best-performing schedule, so only that library
is needed. There isn't currently a way to force TVM to pick a particular schedule outside of
autotuning logs, but that would be a good addition.
Finally, when using the ``llvm`` backend, the process is similar except that LLVM bitcode is included
in the generated code (with an ``import_llvm`` pragma). LLVM bitcode provides a portable way to call
inline assembly. However, it may be more complex to call external C functions, and helper functions
are of course not easy to use from LLVM bitcode.
Executing Models
----------------
The TVM compiler traditionally outputs three pieces:
1. Model operator implementations, as discussed above;
2. A model execution graph, encoded as JSON; and
3. Simplified parameters.
To correctly execute the model, a Graph Executor needs to reconstruct the graph in memory, load the
parameters, and then invoke the operator implementations in the correct order.
microTVM supports two ways to do this:
1. **Host-Driven**. The Graph Executor can run on the host and carry out execution by issuing
commands to the device using an RPC link with a UART-like transport.
2. **Standalone**. A C Graph Executor is available to be compiled on-device, but it is not
particularly memory efficient. This way enables standalone execution without any attached host.
Host-Driven is designed for experimenting with models on-device and, like AutoTVM, uses the RPC server to
drive computation on-device. Standalone is intended for deployment.
Host-Driven Execution
^^^^^^^^^^^^^^^^^^^^^
In Host-Driven execution, the firmware binary is the following:
1. Generated operator implementations from TVM.
2. The TVM C runtime.
3. SoC-specific initialization.
4. The TVM RPC server.
5. (optional) Simplified Parameters.
This firmware image is flashed onto the device and a GraphExecutor instance is created on the host.
The GraphExecutor drives execution by sending RPC commands over a UART:
.. figure:: https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_host_driven.svg
:align: center
:width: 85%
Standalone Execution
^^^^^^^^^^^^^^^^^^^^
In Standalone execution, the GraphExecutor is instantiated on device:
.. figure:: https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_standalone.svg
:align: center
:width: 85%
microTVM Firmware
------------------
We can now discuss how microTVM firmware should behave. An important task common to both model
execution strategies is configuring the SoC to match the way it performs in production. microTVM
considers this task project- and SoC-dependent. Whether for AutoTVM, host-driven model inference, or
in standalone deployment, the user is expected to supply a project whose main() does the following:
1. Configure the SoC to match deployment performance.
2. Initialize the TVM C Runtime.
When configuring for host-driven inference or AutoTVM, the remaining tasks are well-defined:
3. Initialize a transport (i.e. a UART) for use with the TVM RPC server.
4. Launch the TVM RPC Server.
When configuring for standalone deployment, the firmware needs to:
1. Instantiate the system library by calling the ``runtime.SystemLib`` PackedFunc.
2. Instantiate a GraphExecutor passing the system library module.
3. Configure parameters and inputs as needed.
4. Run the model.
Parts of a microTVM Binary
--------------------------
To summarize, a microTVM firwmare binary image must contain these parts:
1. Operator implementations, produced by TVM.
2. The TVM C runtime library, supplied by TVM as a static library.
3. SoC Initialization, supplied by the user.
For Host-driven model execution, firmware also needs:
4. The TVM RPC Server library.
For Standalone model execution, firmware also needs:
4. The TVM C GraphExecutor library, supplied by TVM as a static library.
5. The remaining compiler outputs (Simplified Parameters and Graph JSON).
The Automated Build Flow
------------------------
Once code generation is complete, ``tvm.relay.build`` returns a ``tvm.runtime.Module`` and the
user can save the generated C source or binary object to a ``.c`` or ``.o`` file. From this point, TVM
can theoretically step back and the user can compile and run the code separately.
However, for AutoTVM, TVM needs some automated flow to handle the following tasks:
1. Integrate operator implementations, the TVM C Runtime library, and the TVM RPC Server library into the
firmware project containing user-supplied SoC Initialization.
2. Build the resulting project.
3. Program the built firmware onto a (specific) attached device.
4. Identify the serial port or other transport to be used by TVM to drive remote execution.
At present, TVM expects the user to supply an implementation of the ``tvm.micro.Compiler``,
``tvm.micro.Flasher``, and ``tvm.micro.Transport`` interfaces. TVM then:
1. Builds each piece separately as a library.
2. Builds the libraries into a binary firmware image.
3. Programs the firmware image onto an attached device.
4. Opens a serial port to serve as the RPC server transport.
This design was chosen to reduce build times for microTVM (the common libraries need to be built
only once per candidate operator implemmentation). In practice, these projects are extremely small
and compile relatively quickly. Compared with the added complexity of this tighter build integration
with TVM, the performance gains are likely not worth it. A future design will consolidate the build
tasks into a single step and narrow the interface to provide a better integration.
Measuring operator performance
------------------------------
The TVM C runtime depends on user-supplied functions to measure time on-device. Users should implement
``TVMPlatformTimerStart`` and ``TVMPlatformTimerStop``. These functions should measure wall clock time, so there
are some pitfalls in implementing these functions:
1. If the CPU could halt or sleep during a computation (i.e. if it is being done on an accelerator),
a cycle counter should likely not be used as these tend to stop counting while the CPU is asleep.
2. The granularity of these functions can be relaxed as needed to extend the range of the timer
device. However, if granularity is too coarse, a sub-optimal schedule may be used.
3. An error should be raised if the timer overflows.
4. The timer should not interrupt computation unless absolutely necessary. Doing so may affect the
accuracy of the results.
5. Calibrating the output against a wall clock is ideal, but it will likely be too cumbersome. A
future PR could enable some characterization of the platform timer by, e.g., measuring the internal
oscillator against a reference such as an external crystal.
Future Work
===========
Ahead-of-Time Runtime
----------------------
A limitation of the Graph Executor is the amount of memory overhead required in parsing the JSON.
The current implementation contributes significantly to the dynamic memory usage of microTVM,
limiting its utility. An ahead-of-time runtime can avoid the need for any Graph JSON parsing and
improve inference speed by generating C code to call the generated operator implementations directly
rather than relying on a data-driven approach with the Graph Executor.
Memory Planning
----------------
The current memory planner attempts to limit the number of ``TVMBackendDeviceAlloc()`` calls
issued for intermediate tensors only. Because scratchpads can vary widely, and because the planner
coalesces memory allocations within 16x of each other, this strategy typically results in high
peak memory usage.
Heterogeneous Execution
-----------------------
Newer Cortex-M SoCs can contain multiple CPUs and onboard ML accelerators.
Autotuning Target
-----------------
As discussed previously,