| .. Licensed to the Apache Software Foundation (ASF) under one |
| or more contributor license agreements. See the NOTICE file |
| distributed with this work for additional information |
| regarding copyright ownership. The ASF licenses this file |
| to you under the Apache License, Version 2.0 (the |
| "License"); you may not use this file except in compliance |
| with the License. You may obtain a copy of the License at |
| .. http://www.apache.org/licenses/LICENSE-2.0 |
| .. Unless required by applicable law or agreed to in writing, |
| software distributed under the License is distributed on an |
| "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY |
| KIND, either express or implied. See the License for the |
| specific language governing permissions and limitations |
| under the License. |
| |
| ************************** |
| microTVM Design Document |
| ************************** |
| |
| .. contents:: Table of Contents |
| :depth: 3 |
| |
| Background |
| =========== |
| |
| TVM is a model deployment framework that has demonstrated good performance across a wide range of |
| models on traditional operating systems. Given TVM's layered approach to compilation, it is a |
| natural extension to target bare metal devices. While most of the compilation flow does not need to |
| change for a proof-of-concept implementation on such devices, the runtime cannot depend on: |
| |
| * **Virtual Memory**, and by extension any system-provided ``malloc``. Additionally, bare metal |
| devices typically have very limited memory (measured in KB). Because of this, libraries designed |
| for such platforms typically need to be more judicious in using memory, and need to release |
| memory when it is not in use. |
| * Traditional OS abstractions, such as **files**, **libraries**, and **kernel functions**. Some |
| projects implement support for these, but they are by no means standard. |
| * Support for programming languages other than **C**. |
| |
| Such changes require a different approach from the TVM C++ runtime typically used on traditional |
| Operating Systems. |
| |
| Typical Use |
| =========== |
| |
| This section discusses our vision of the "typical" microTVM use case. Each component used to achieve |
| this typical use case is intended to be designed for flexibility, but this unifying vision serves to |
| motivate the inclusion of each part of the design. |
| |
| .. figure:: https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_workflow.svg |
| :align: center |
| :width: 85% |
| |
| The parts of this process are described below: |
| |
| #. **Model Import**. The user imports an existing model or describes a new model to TVM, producing a |
| *Relay module*. |
| |
| #. **Model Transformations**. The user can apply transformations, such as quantization, to the |
| model. After each transformation, the user should still have a Relay module. |
| |
| #. **Compilation** (Scheduling and Code Generation). TVM implements each operator into Tensor IR by |
| assigning a schedule and schedule configuration to each Relay operator. Then, code (C source or |
| compiled object) is generated for each operator. |
| |
| #. **Integration**. The generated code is integrated along with the TVM C Runtime library into a |
| user-supplied binary project. In some cases (such as when the project is standardized across |
| multiple SoC/development boards), this process is handled automatically. |
| |
| #. **Deployment**. The project is built and the residual firmware binary is flashed onto the device. |
| Model inference is driven either by TVM using an on-device RPC server, or on the device using the |
| on-device Graph Executor. |
| |
| Design Goals |
| ============ |
| |
| microTVM aims to achieve these design goals: |
| |
| 1. **Portable Code**. microTVM can translate any Relay model into C code that can compile with only |
| a C standard library. |
| 2. **Minimal Overhead**. microTVM generates target-specific, highly optimized code. As much overhead |
| from the runtime should be removed. |
| 3. **Accessible Code**. microTVM considers C source code as a first-class output mechanism so that |
| it is easier for a firmware engineer to understand and tweak. |
| |
| Overview |
| ======== |
| |
| microTVM requires changes at all levels of the TVM compiler stack. The following sub-sections enumerate |
| these changes at a high level, and follow-on sections discuss the specifics in more detail. |
| |
| Modeling Target Platforms |
| ------------------------- |
| |
| TVM's search-based optimization approach allows it to largely avoid system-level modeling of targets |
| in favor of experimental results. However, some modeling is necessary in order to ensure TVM is |
| comparing apples-to-apples search results, and to avoid wasting time during the search by attempting |
| to compile invalid code for a target. |
| |
| microTVM models these parts of the target: |
| |
| * The CPU used, through the ``-mcpu`` and ``-march`` target flags. |
| * The presence or absence of accelerators, through the device components of the target (Currently |
| only the absence of accelerators can be expressed, but this mechanism should extend well). |
| |
| microTVM aims to model these parts of the target in the future: |
| |
| * Memory, modeled as a set of disjoint memory spaces, each with a label and size and prefetch/flush |
| behavior. Some memory may be shared with accelerators. |
| * Target runtime configuration (i.e. clock tree configuration, clock speed, etc). This is intended |
| only to contribute to the AutoTVM schedule key and not for any other use. |
| |
| At this time, TVM does not intend to model: |
| |
| * Size, type, or relationship of caches, with the exception of prefetching or cache flushing. |
| |
| |
| TVM Targets for microTVM |
| ------------------------- |
| |
| A central data structure in the compilation process is the ``tvm::target::Target`` class. TVM uses |
| Target to decide which TIR schedules to enable and how to configure the code generator. The Target |
| class should also uniquely identify the generated code for a particular operator, as autotuning |
| logs use it to rank measured performance (but see Future Work). |
| |
| Targets are currently represented as strings structured similarly to command-line arguments. An |
| example target is shown below: |
| |
| ``c -keys=arm_cpu -mcpu=cortex-m7 -link-params -model=stm32f746xx -runtime=c -system-lib=1`` |
| |
| The relevant parts to microTVM are: |
| |
| * Code generator (``llvm`` or ``c``) |
| * ``-mcpu=cortex-m7``: used by TOPI to enable Cortex-M schedules, and, when the C source code |
| generator is selected, included in the output as a comment to help identify the code and |
| configure the downstream C compiler. |
| * ``-link-params``: include parameters as global constants to load from flash. |
| * ``-runtime=c``: build glue code to allow operators to work with the C runtime |
| * ``-system-lib=1``: emit a system library (i.e. which can be loaded by calling the PackedFunc |
| ``runtime.SystemLib``. |
| |
| Writing Schedules for microTVM |
| ------------------------------ |
| |
| For operations scheduled on the CPU, microTVM initially plans to make use of specialized |
| instructions and extern (i.e. hand-optimized) functions to achieve good performance. In TVM, this |
| approach is generally accomplished through tensorization, in which TVM breaks a computation into |
| small pieces, and a TIR extern function accelerates each small piece. |
| |
| TVM currently accommodates both approaches using ``tir.call_extern``. First, a pragma is attached to |
| the schedule defining the extern function in portable C. |
| |
| ``sched[output].pragma(n, "import_c", "void call_asm(int32_t* a, int32_t* b) { /* ... */ }")`` |
| |
| Next, ``tensorize`` is used to split the computation. |
| |
| ``sched[output].tensorize(owi, gemm)`` |
| |
| There are a couple of caveats to this approach, all which could be resolved by linking generated |
| code against external libraries: |
| |
| * Inline assembly is compiler-specific. While Clang and GCC have standardized on one syntax, this |
| may not be portable to other compilers. SDKs solve this by conditionally including a header file |
| depending on the compiler being used. However, taking this approach means that the generated code |
| needs additional compiler flags (i.e. ``-Isystempath/to/header``). |
| * It may be helpful to reference helper functions from the generated code (e.g. to inline common |
| sequences of hand-optimized assembly). |
| * Finally, the extern function invoked may be wholly written in an external library. If those |
| functions can be wholly inlined, this caveat is the same as the previous. If not, then additional |
| C code needs to be compiled and linked against the operator. |
| |
| At present, microTVM presumes that all eligible schedules can be compiled. This means that the user- |
| supplied project (see next section) must include all libraries that are used by the generated code. |
| When not using autotuning, TVM randomly chooses a fallback schedule, so all libraries would need to |
| be supported. When using autotuning, TVM selects the best-performing schedule, so only that library |
| is needed. There isn't currently a way to force TVM to pick a particular schedule outside of |
| autotuning logs, but that would be a good addition. |
| |
| Finally, when using the ``llvm`` backend, the process is similar except that LLVM bitcode is included |
| in the generated code (with an ``import_llvm`` pragma). LLVM bitcode provides a portable way to call |
| inline assembly. However, it may be more complex to call external C functions, and helper functions |
| are of course not easy to use from LLVM bitcode. |
| |
| Executing Models |
| ---------------- |
| |
| The TVM compiler traditionally outputs three pieces: |
| |
| 1. Model operator implementations, as discussed above; |
| 2. A model execution graph, encoded as JSON; and |
| 3. Simplified parameters. |
| |
| To correctly execute the model, a Graph Executor needs to reconstruct the graph in memory, load the |
| parameters, and then invoke the operator implementations in the correct order. |
| |
| microTVM supports two ways to do this: |
| |
| 1. **Host-Driven**. The Graph Executor can run on the host and carry out execution by issuing |
| commands to the device using an RPC link with a UART-like transport. |
| 2. **Standalone**. A C Graph Executor is available to be compiled on-device, but it is not |
| particularly memory efficient. This way enables standalone execution without any attached host. |
| |
| Host-Driven is designed for experimenting with models on-device and, like AutoTVM, uses the RPC server to |
| drive computation on-device. Standalone is intended for deployment. |
| |
| Host-Driven Execution |
| ^^^^^^^^^^^^^^^^^^^^^ |
| |
| In Host-Driven execution, the firmware binary is the following: |
| |
| 1. Generated operator implementations from TVM. |
| 2. The TVM C runtime. |
| 3. SoC-specific initialization. |
| 4. The TVM RPC server. |
| 5. (optional) Simplified Parameters. |
| |
| This firmware image is flashed onto the device and a GraphExecutor instance is created on the host. |
| The GraphExecutor drives execution by sending RPC commands over a UART: |
| |
| .. figure:: https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_host_driven.svg |
| :align: center |
| :width: 85% |
| |
| Standalone Execution |
| ^^^^^^^^^^^^^^^^^^^^ |
| |
| In Standalone execution, the GraphExecutor is instantiated on device: |
| |
| .. figure:: https://raw.githubusercontent.com/tvmai/web-data/main/images/dev/microtvm_standalone.svg |
| :align: center |
| :width: 85% |
| |
| microTVM Firmware |
| ------------------ |
| |
| We can now discuss how microTVM firmware should behave. An important task common to both model |
| execution strategies is configuring the SoC to match the way it performs in production. microTVM |
| considers this task project- and SoC-dependent. Whether for AutoTVM, host-driven model inference, or |
| in standalone deployment, the user is expected to supply a project whose main() does the following: |
| |
| 1. Configure the SoC to match deployment performance. |
| 2. Initialize the TVM C Runtime. |
| |
| When configuring for host-driven inference or AutoTVM, the remaining tasks are well-defined: |
| |
| 3. Initialize a transport (i.e. a UART) for use with the TVM RPC server. |
| 4. Launch the TVM RPC Server. |
| |
| When configuring for standalone deployment, the firmware needs to: |
| |
| 1. Instantiate the system library by calling the ``runtime.SystemLib`` PackedFunc. |
| 2. Instantiate a GraphExecutor passing the system library module. |
| 3. Configure parameters and inputs as needed. |
| 4. Run the model. |
| |
| Parts of a microTVM Binary |
| -------------------------- |
| |
| To summarize, a microTVM firwmare binary image must contain these parts: |
| |
| 1. Operator implementations, produced by TVM. |
| 2. The TVM C runtime library, supplied by TVM as a static library. |
| 3. SoC Initialization, supplied by the user. |
| |
| For Host-driven model execution, firmware also needs: |
| |
| 4. The TVM RPC Server library. |
| |
| For Standalone model execution, firmware also needs: |
| |
| 4. The TVM C GraphExecutor library, supplied by TVM as a static library. |
| 5. The remaining compiler outputs (Simplified Parameters and Graph JSON). |
| |
| The Automated Build Flow |
| ------------------------ |
| |
| Once code generation is complete, ``tvm.relay.build`` returns a ``tvm.runtime.Module`` and the |
| user can save the generated C source or binary object to a ``.c`` or ``.o`` file. From this point, TVM |
| can theoretically step back and the user can compile and run the code separately. |
| |
| However, for AutoTVM, TVM needs some automated flow to handle the following tasks: |
| |
| 1. Integrate operator implementations, the TVM C Runtime library, and the TVM RPC Server library into the |
| firmware project containing user-supplied SoC Initialization. |
| 2. Build the resulting project. |
| 3. Program the built firmware onto a (specific) attached device. |
| 4. Identify the serial port or other transport to be used by TVM to drive remote execution. |
| |
| At present, TVM expects the user to supply an implementation of the ``tvm.micro.Compiler``, |
| ``tvm.micro.Flasher``, and ``tvm.micro.Transport`` interfaces. TVM then: |
| |
| 1. Builds each piece separately as a library. |
| 2. Builds the libraries into a binary firmware image. |
| 3. Programs the firmware image onto an attached device. |
| 4. Opens a serial port to serve as the RPC server transport. |
| |
| This design was chosen to reduce build times for microTVM (the common libraries need to be built |
| only once per candidate operator implemmentation). In practice, these projects are extremely small |
| and compile relatively quickly. Compared with the added complexity of this tighter build integration |
| with TVM, the performance gains are likely not worth it. A future design will consolidate the build |
| tasks into a single step and narrow the interface to provide a better integration. |
| |
| Measuring operator performance |
| ------------------------------ |
| |
| The TVM C runtime depends on user-supplied functions to measure time on-device. Users should implement |
| ``TVMPlatformTimerStart`` and ``TVMPlatformTimerStop``. These functions should measure wall clock time, so there |
| are some pitfalls in implementing these functions: |
| |
| 1. If the CPU could halt or sleep during a computation (i.e. if it is being done on an accelerator), |
| a cycle counter should likely not be used as these tend to stop counting while the CPU is asleep. |
| 2. The granularity of these functions can be relaxed as needed to extend the range of the timer |
| device. However, if granularity is too coarse, a sub-optimal schedule may be used. |
| 3. An error should be raised if the timer overflows. |
| 4. The timer should not interrupt computation unless absolutely necessary. Doing so may affect the |
| accuracy of the results. |
| 5. Calibrating the output against a wall clock is ideal, but it will likely be too cumbersome. A |
| future PR could enable some characterization of the platform timer by, e.g., measuring the internal |
| oscillator against a reference such as an external crystal. |
| |
| Future Work |
| =========== |
| |
| Ahead-of-Time Runtime |
| ---------------------- |
| |
| A limitation of the Graph Executor is the amount of memory overhead required in parsing the JSON. |
| The current implementation contributes significantly to the dynamic memory usage of microTVM, |
| limiting its utility. An ahead-of-time runtime can avoid the need for any Graph JSON parsing and |
| improve inference speed by generating C code to call the generated operator implementations directly |
| rather than relying on a data-driven approach with the Graph Executor. |
| |
| Memory Planning |
| ---------------- |
| |
| The current memory planner attempts to limit the number of ``TVMBackendDeviceAlloc()`` calls |
| issued for intermediate tensors only. Because scratchpads can vary widely, and because the planner |
| coalesces memory allocations within 16x of each other, this strategy typically results in high |
| peak memory usage. |
| |
| Heterogeneous Execution |
| ----------------------- |
| |
| Newer Cortex-M SoCs can contain multiple CPUs and onboard ML accelerators. |
| |
| |
| Autotuning Target |
| ----------------- |
| |
| As discussed previously, |