blob: 8d0c42b2ff643b2726914c9e43d4480d837831b3 [file] [log] [blame]
<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>TVM</title>
<link href="https://tvm.apache.org" rel="self"/>
<link href="https://tvm.apache.org"/>
<updated>2022-01-10T13:22:42-05:00</updated>
<id>https://tvm.apache.org</id>
<author>
<name></name>
<email></email>
</author>
<entry>
<title>Apache TVM Unity: a vision for the ML software &amp; hardware ecosystem in 2022</title>
<link href="https://tvm.apache.org/2021/12/15/tvm-unity"/>
<updated>2021-12-15T00:00:00-05:00</updated>
<id>https://tvm.apache.org/2021/12/15/tvm-unity</id>
<content type="html">&lt;p&gt;Apache TVM Unity is a roadmap for the TVM ecosystem in 2022. We see a broader shift coming in the way that machine learning system stacks optimize for flexibility and agility in the face of a rapidly changing hardware landscape. TVM will evolve to break down the boundaries that constrain the ways current ML systems adapt to rapid changes in ML models and the accelerators that implement them.&lt;/p&gt;
&lt;h2 id=&quot;boundaries-in-the-modern-ml-system-stack&quot;&gt;Boundaries in the Modern ML System Stack&lt;/h2&gt;
&lt;p&gt;&lt;img src=&quot;/images/tvm-unity/image4.png&quot; alt=&quot;image&quot; style=&quot;width: 40%; margin: auto; display: block;&quot; /&gt;&lt;/p&gt;
&lt;p&gt;The system stack for modern machine learning consists of four kinds of abstractions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;The &lt;em&gt;computational graph&lt;/em&gt; abstraction encodes the flow of data between coarse-grained tensor operators. Computational graphs are the high-level abstraction users interact with in &lt;a href=&quot;https://www.tensorflow.org/&quot;&gt;TensorFlow&lt;/a&gt;, &lt;a href=&quot;https://mxnet.apache.org/&quot;&gt;MXNet&lt;/a&gt;, and &lt;a href=&quot;https://pytorch.org/&quot;&gt;PyTorch&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Tensor programs&lt;/em&gt; implement the code for the operators in the computational graph. Deep learning compilers generate the low-level C++ or CUDA code for computations like convolutions or matrix multiplications.&lt;/li&gt;
&lt;li&gt;Similarly, &lt;em&gt;libraries and runtimes&lt;/em&gt; include pre-written code to execute and orchestrate tensor operations. BLAS packages and libraries like cuDNN provide extensively tuned operator implementations for specific hardware targets.&lt;/li&gt;
&lt;li&gt;&lt;em&gt;Hardware primitives&lt;/em&gt; are at the bottom of the stack. Here, low-level assembly languages and hardware accelerator interfaces expose the raw capabilities of the machine.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;There are &lt;em&gt;vertical&lt;/em&gt; boundaries between the abstraction levels that prohibit cross-layer interactions and feedback between the levels. There is also a &lt;em&gt;horizontal&lt;/em&gt; boundary between two opposing ways that software stacks can treat the central tensor computation level. The horizontal boundary divides &lt;em&gt;library-based&lt;/em&gt; and &lt;em&gt;compilation-based&lt;/em&gt; approaches to tensor computation.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/tvm-unity/image1.png&quot; alt=&quot;image&quot; style=&quot;width: 70%; margin: auto; display: block;&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Library-based frameworks rely on collections of pre-made, carefully tuned operator implementations as their computational workhorse. Compilation-based frameworks instead generate their own custom tensor operation code from scratch. Modern software stacks typically use one style or the other, but they don’t combine them: most deep learning frameworks are library-based, while most deep learning compilers cannot incorporate libraries and runtimes.&lt;/p&gt;
&lt;p&gt;In the current landscape of ML systems, the boundaries between these layers tend to be strict. Neither approach is better than the other, but they have trade-offs. Library-based stacks excel on standard styles of ML models because they benefit from years of engineering investment common operators. On the other side, the flexibility and automation in compilation-based frameworks can be better for emerging models that require new operators.&lt;/p&gt;
&lt;p&gt;Vertical boundaries exist in both styles of software stack. AI applications start at the top of the stack and march through the layers from top to bottom. Frameworks choose data layout and operator fusion strategies at the graph level; then the tensor computations carry out the operators selected in the computational graph; and these operators map onto a fixed set of hardware primitives. It’s a one-shot, unidirectional workflow: performance constraints at the level of tensor programs, for example, cannot feed back to influence the data layout at the computational graph level. And incorporating custom hardware typically means manually propagating new features through all three layers.&lt;/p&gt;
&lt;p&gt;Both vertical and horizontal boundaries are slowing down the pace of innovation in machine learning. New hardware accelerators are emerging with new levels of capability and performance, but harnessing them will require fluid collaboration between ML scientists, ML engineers, hardware vendors that these boundaries prevent. To cope with the rapid pace of change in ML systems, frameworks need to support &lt;strong&gt;incremental&lt;/strong&gt; evolution: Incorporating new capabilities should require effort proportional to the change, not wholesale re-engineering at each level.&lt;/p&gt;
&lt;h2 id=&quot;tvm-unity&quot;&gt;TVM Unity&lt;/h2&gt;
&lt;p&gt;The TVM Unity vision is about breaking down these barriers. The goal is to enable cross-layer interactions and automate their optimization. It is not to collapse the abstraction layers into a monolith: there is no “silver bullet” representation for AI programs that simultaneously enables optimization at every level. Instead, TVM Unity will build interfaces for the abstractions to interact and exchange information.&lt;/p&gt;
&lt;p&gt;Removing the strict barriers between the levels in the system stack will enable new kinds of optimization that work jointly across the layers. A unified view of the entire system will let TVM automatically co-optimize decisions in the computation graph, the tensor operators, and the hardware mapping to search for the best possible implementation of an AI application. At the same time, TVM Unity will also serve as a communication substrate for interactions between ML scientists, ML engineers, and hardware engineers. This collaboration will be crucial for adapting to the rapid changes that are coming in the next phase of hardware acceleration for ML.&lt;/p&gt;
&lt;h3 id=&quot;unifying-abstractions&quot;&gt;Unifying Abstractions&lt;/h3&gt;
&lt;p&gt;&lt;img src=&quot;/images/tvm-unity/image2.png&quot; alt=&quot;image&quot; style=&quot;width: 70%; margin: auto; display: block;&quot; /&gt;&lt;/p&gt;
&lt;p&gt;TVM Unity will focus on letting AI applications fluidly cross the boundaries between operator graphs, tensor programs, and hardware primitives. In TVM, a single Python program can define a core tensor operation, incorporate a custom hardware primitive, and invoke the operation from a larger operator graph.
This example shows all of these capabilities:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tvm.script&lt;/span&gt;
&lt;span class=&quot;kn&quot;&gt;from&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;tvm.script&lt;/span&gt; &lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tir&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relax&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;script&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir_module&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;nc&quot;&gt;MyIRModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Define a TIR based operation.
&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prim_func&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;tir_mm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Buffer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Buffer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Buffer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grid&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;block&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;body&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;vi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vk&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;axis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;remap&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;SSR&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;T&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;init&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Can be mapped on to HW intrinsics.
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;X&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vi&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;vk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;W&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;vk&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wj&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;function&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;relax_func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dataflow&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Invoke the TIR code.
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lv0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;call_dps&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;((&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tir_mm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;w&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;lv1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;m&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;flatten&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lv0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gv0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lv2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;exp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lv1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gv0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Invoke external update rule.
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;R&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;call_packed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;custom_inplace_update&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gv0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gv0&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This code has both a tensor program (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tir_mm&lt;/code&gt;) and computational graph that includes it (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;relax_func&lt;/code&gt;). The high-level data flow can directly invoke the low-level tensor manipulation to build up a larger computation. The TVM runtime unifies the operator graph and compiler-based tensor computation to optimize the entire program. This code also uses &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;call_packed&lt;/code&gt; to invoke a pre-baked operator—showing how TVM can smoothly integrate library-based operators with the custom computation.&lt;/p&gt;
&lt;p&gt;Additionally, TensorIR opens doors to exploit hardware primitives through tensorization. Tensorization transforms loop-level programs to implementations that map onto the primitives that a particular hardware target declares.&lt;/p&gt;
&lt;p&gt;The key to highlight here is &lt;strong&gt;cross layer interactions&lt;/strong&gt;. Our particular example shows interactions between: (1) computational graph and tensor programs; (2) computational graph and runtime libraries; (3) Finally tensor programs and hardware primitives through on-going automatic tensorization developments in TensorIR. These cross layer interactions open doors for making &lt;strong&gt;incremental optimizations&lt;/strong&gt; at the boundary. For example, we can build a customized pass to the lower part of the subgraph to a set of runtime libraries then pass on to the rest of the pipeline.&lt;/p&gt;
&lt;p&gt;In addition to the unification of abstraction layers, we are also working on unifying the shape representation, to enable &lt;strong&gt;first class symbolic shape support&lt;/strong&gt; across the stack. In our particular example, the symbolic shape dimensions(n, m) can flow across the abstractions and enable advanced optimizations for dynamic workloads. The additional capabilities will open doors for both training and inference workload optimizations.&lt;/p&gt;
&lt;h3 id=&quot;unifying-perspectives&quot;&gt;Unifying Perspectives&lt;/h3&gt;
&lt;p&gt;Better ML systems require collaboration between ML scientists, ML engineers, and hardware engineers. The coming era of diverse specialized ML hardware will require coordinated effort from teams that include all three groups. By building rich, bidirectional interfaces between the layers in the system stack, TVM Unity aims to be the medium through which this collaboration and iteration happens.&lt;/p&gt;
&lt;p&gt;Abstractions in TVM can catalyze the lifecycle of an improvement to an AI application. At the highest level, an ML scientist can specify the operator they need to construct the next generation of a model. ML engineers can work at the tensor computation level to make this new operation efficient. Finally, these tensor computations can rely on hardware primitives written by hardware engineers. The work at each level will interact through Python APIs within the TVM ecosystem. The ability to work together within TVM, rather than invasively modifying a framework with each new feature, will be the key to fast iteration in the face of rapidly evolving hardware.&lt;/p&gt;
&lt;h3 id=&quot;automation&quot;&gt;Automation&lt;/h3&gt;
&lt;p&gt;A unified ML system creates a new, larger search space than a system stack with strict boundaries. Decisions within tensor computations can influence the structure of the operator graph, and new hardware primitives can drastically change the optimal mappings at every other layer.&lt;/p&gt;
&lt;p&gt;TVM Unity will expose all these cross-layer interactions for automated optimization. Finding the best implementation for a given application will require learning-driven optimization: using ML to optimize ML by exploring the expanded joint search space and minimize the computational cost.&lt;/p&gt;
&lt;p&gt;In addition to that, we also want to leverage domain experts’ help when possible, and create mechanisms to effectively incorporate domain information to help guide the automatic optimizations.&lt;/p&gt;
&lt;h2 id=&quot;new-capabilities-with-unity&quot;&gt;New Capabilities with Unity&lt;/h2&gt;
&lt;p&gt;The Unity vision guides the technical roadmap for TVM’s evolution over the next year. The unified approach will position TVM to offer new forms of automation and ecosystem integration that are not possible with today’s system stacks.&lt;/p&gt;
&lt;p&gt;With Unity, TVM will unify library-based computation with compiler-based automation. AI applications will be able to combine the world’s best known code for common operators with automatically optimized code for computations that don’t map neatly onto any existing operator. Developers will be able to smoothly transition between both strategies without a steep “performance cliff” when switching from built-in to generated code. Teams will be able to iterate rapidly with compiled code for new model designs and then, as models mature and stabilize, fluidly incorporate optimized operator libraries to maximize performance. By erasing the boundary between operator-based and compiler-based stacks, TVM will enable automatic exploration of the trade-off space between the two extremes.&lt;/p&gt;
&lt;p&gt;TVM also aims to serve as a bridge to unify the broader ML and hardware ecosystems. In the ML ecosystem, TVM offers a minimal runtime that does not constrain teams’ choice of frameworks. TVM models will be easy to embed into other frameworks and runtimes as subgraphs for both training and inference. Through exchange formats like &lt;a href=&quot;https://onnx.ai/&quot;&gt;ONNX&lt;/a&gt; and &lt;a href=&quot;https://pytorch.org/docs/stable/jit.html&quot;&gt;TorchScript&lt;/a&gt;, TVM models can fluidly integrate into larger applications built on any infrastructure. In the hardware ecosystem, TVM is already the best way for accelerator designers to integrate with ML applications. With TVM Unity, hardware vendors will easily onboard into TVM via a simple set of operators and then incrementally transition to compilation-based integration for better flexibility. This way, new hardware capabilities can get started improving AI applications without reinventing the whole system stack.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/tvm-unity/image3.png&quot; alt=&quot;image&quot; style=&quot;width: 50%; margin: auto; display: block;&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Beyond TVM alone, the same forces that are driving TVM Unity exist across the theory and practice of modern ML. Rapid changes to models, emerging alternative hardware, and aging abstraction boundaries all point toward the need for an integrated approach. We expect TVM to lead the way into the next great industry-wide shift in ML systems.&lt;/p&gt;
&lt;p&gt;For more details about our vision for TVM, check out &lt;a href=&quot;https://www.tvmcon.org&quot;&gt;TVMCon 2021&lt;/a&gt; for more talks and discussion.&lt;/p&gt;
</content>
</entry>
<entry>
<title>Introducing TVM Auto-scheduler (a.k.a. Ansor)</title>
<link href="https://tvm.apache.org/2021/03/03/intro-auto-scheduler"/>
<updated>2021-03-03T00:00:00-05:00</updated>
<id>https://tvm.apache.org/2021/03/03/intro-auto-scheduler</id>
<content type="html">&lt;p&gt;Optimizing the execution speed of deep neural networks is extremely hard with the growing
model size, operator diversity, and hardware heterogeneity.
From a computational perspective, deep neural networks are just layers and layers of tensor computations.
These tensor computations, such as matmul and conv2d, can be easily described by mathematical expressions.
However, providing high-performance implementations for them on modern hardware can be very challenging.
We have to apply various low-level optimizations and utilize special hardware intrinsics to achieve high performance.
It takes huge engineering effort to build linear algebra and neural network acceleration libraries like CuBLAS, CuDNN, oneMKL, and oneDNN.&lt;/p&gt;
&lt;p&gt;Our life will be much easier if we can just write mathematical expressions and have something
magically turn them into efficient code implementations.
Three years ago, deep learning compiler TVM and its search module AutoTVM were built as the first step towards this goal.
AutoTVM employs a template-based search algorithm to find efficient implementations for a given tensor computation.
However, it is a template-based approach, so it still requires domain experts to implement a non-trivial manual template
for every operator on every platform.
Today, there are more than 15k lines of code for these templates in the TVM code repository.
Besides being very hard to develop, these templates often have inefficient and limited search spaces,
making them unable to achieve optimal performance.&lt;/p&gt;
&lt;p&gt;To address the limitations of AutoTVM, we started project Ansor aiming at a fully automated auto-scheduler for
generating code for tensor computations.
Ansor auto-scheduler only takes tensor expressions as input and generates high-performance code without manual templates.
We made innovations in the search space construction and search algorithm.
As a result, the auto-scheduler can achieve better performance with less search time in a more automated way.&lt;/p&gt;
&lt;p&gt;Ansor auto-scheduler is now integrated into Apache TVM as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tvm.auto_scheduler&lt;/code&gt; package.
This is a joint effort by collaborators from UC Berkeley, Alibaba, AWS and OctoML.
Detailed tutorials are available for Intel CPUs, ARM CPUs, NVIDIA GPUs, and Mali GPUs on the TVM website [1].
In this blog post, we will give a high-level introduction and show some benchmark results.&lt;/p&gt;
&lt;h1 id=&quot;system-overview&quot;&gt;System Overview&lt;/h1&gt;
&lt;h2 id=&quot;autotvm-vs-auto-scheduler&quot;&gt;AutoTVM vs Auto-scheduler&lt;/h2&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/workflow.png&quot; alt=&quot;image&quot; width=&quot;75%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Table 1. Workflow Comparision &lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;Table 1 compares the workflow for generating code for an operator in AutoTVM and auto-scheduler.
In AutoTVM, the developer has to go through three steps.
In step 1, the developer has to write the compute definition in TVM’s tensor expression language.
This part is relatively easy because TVM’s tensor expression language looks just like math expressions.
In step 2, the developer has to write a schedule template, which typically consists of 20-100 lines of tricky DSL code.
This part requires domain expertise of both the target hardware architecture and operator semantics, so it is difficult.
The last step, step 3, is automated by a search algorithm.&lt;/p&gt;
&lt;p&gt;In auto-scheduler, we eliminate the most difficult step 2 by automatic search space construction and accelerate step 3 with a better search algorithm.
By doing automatic search space construction, we not only eliminate huge manual effort,
but also enabling the exploration of much more optimization combinations.
This automation does not come for free, because we still need to design rules to generate the search space.
However, these rules are very general. They are based on static analysis of the tensor expressions.
We only need to design a few general rules once and can apply them to almost all tensor computations in deep learning.&lt;/p&gt;
&lt;h2 id=&quot;search-process&quot;&gt;Search Process&lt;/h2&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/search_overview.png&quot; alt=&quot;image&quot; width=&quot;40%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Figure 1. Search Process Overview &lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;Figure 1. shows the search process of auto-scheduler when optimizing a whole neural network.
The system takes deep learning models as input.
It then partitions the big model into small subgraphs with Relay’s operator fusion pass.
A task scheduler is utilized to allocate the time resource for optimizing many subgraphs.
At each iteration, it picks a subgraph that has the most potential to increase the end-to-end performance.
For this subgraph, we analyze its tensor expression and generate several sketches for it.
Then we run evolutionary search with a learned cost model to get a batch of optimized programs.
The optimized programs are sent to actual hardware for measurements.
When the measurements are finished, the profiling results are used as feedback to update all components of the system.
This process is repeated iteratively until the optimization converges or we run out of time budget.
More technical details can be found in our paper [3] and our code.&lt;/p&gt;
&lt;p&gt;It is worth notiing that since the auto-scheduler generates schedules from scratch,
it reuses the existing computation definitions in TOPI but not schedule templates.&lt;/p&gt;
&lt;h1 id=&quot;benchmark-results&quot;&gt;Benchmark Results&lt;/h1&gt;
&lt;p&gt;In this section, we benchmark the performance of AutoTVM and Auto-scheduler.
The CPU benchmark is done on an AWS c5.9xlarge, which is equipped with an Intel 18-core skylake 8124-m CPU.
The GPU benchmark is done on an AWS g4dn.4xlarge, which is equipped with an NVIDIA T4 GPU.
All benchmark code, raw data, tuning logs can be found in this repo [2].&lt;/p&gt;
&lt;h2 id=&quot;performance-of-the-generated-code&quot;&gt;Performance of the generated code&lt;/h2&gt;
&lt;p&gt;We benchmark the fp32 single-batch inference latency on three networks.
Figure 2 shows the relative speedup of auto-scheduler against AutoTVM.
We can see auto-scheduler outperforms AutoTVM in all cases with 1.02x to 8.95x speedup.
This is because auto-scheduler explores a larger search space, which covers more efficient combinations
of optimizations that are missed in TOPI manual templates.
The BERT-base@GPU is an extreme case where the manual templates are very badly designed.
In other words, the manual template for dense layers does not perform well for the shapes in BERT model.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/code_perf.png&quot; alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Figure 2. Code Performance Comparision (Higher is better) &lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h2 id=&quot;search-time&quot;&gt;Search Time&lt;/h2&gt;
&lt;p&gt;As we know, the search-based approaches can be very time-consuming, so we also care about the search time.
It typically takes several hours to let the search converge for a single neural network.
Figure 3 compares the search time of AutoTVM and auto-scheduler.
Auto-scheduler requires much less time to converge in most cases, despite its larger search space.
This is mainly because of auto-scheduler has a better cost model and task scheduler.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/intro-auto-scheduler/search_time.png&quot; alt=&quot;image&quot; width=&quot;85%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Figure 3. Search Time Comparision (Lower is better) &lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h2 id=&quot;more-results&quot;&gt;More Results&lt;/h2&gt;
&lt;p&gt;The repo above serves as an internal benchmark tool for TVM, so it only compares the latest AutoTVM and AutoScheduler.
You can find results for more libraries and backends in our paper [3].
Recently, this blog post [4] also tried auto-scheduler on an Apple M1 chip and got some good results.&lt;/p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;We build TVM auto-scheduler, a system that automatically generates high-performance code for tensor expressions.
Compared with the predecessor AutoTVM, auto-scheduler does not require manual templates.
Besides, auto-scheduler is capable of generating schedules with better performance in a shorter time.
We achieve this by making innovations in the search space construction and search algorithm.&lt;/p&gt;
&lt;p&gt;We are excited about the current performance of auto-scheduler.
In the future, we are interested in extending the ability of auto-scheduler to support
sparse operators, low-precision operators, and dynamic shape better.&lt;/p&gt;
&lt;h1 id=&quot;links&quot;&gt;Links&lt;/h1&gt;
&lt;p&gt;[1] Tutorials: &lt;a href=&quot;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&quot;&gt;https://tvm.apache.org/docs/tutorials/index.html#autoscheduler-template-free-auto-scheduling&lt;/a&gt;&lt;br /&gt;
[2] Benchmark repo: &lt;a href=&quot;https://github.com/tlc-pack/TLCBench&quot;&gt;https://github.com/tlc-pack/TLCBench&lt;/a&gt;&lt;br /&gt;
[3] OSDI Paper: &lt;a href=&quot;https://arxiv.org/abs/2006.06762&quot;&gt;Ansor : Generating High-Performance Tensor Programs for Deep Learning&lt;/a&gt;&lt;br /&gt;
[4] Results on Apple M1 chip: &lt;a href=&quot;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&quot;&gt;https://medium.com/octoml/on-the-apple-m1-beating-apples-core-ml-4-with-30-model-performance-improvements-9d94af7d1b2d&lt;/a&gt;.&lt;/p&gt;
</content>
</entry>
<entry>
<title>Bring Your Own Datatypes: Enabling Custom Datatype Exploration in TVM</title>
<link href="https://tvm.apache.org/2020/09/26/bring-your-own-datatypes"/>
<updated>2020-09-26T00:00:00-04:00</updated>
<id>https://tvm.apache.org/2020/09/26/bring-your-own-datatypes</id>
<content type="html">&lt;p&gt;In this post, we describe the Bring Your Own Datatypes framework, which enables the use of custom datatypes within TVM.&lt;/p&gt;
&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;When designing accelerators, an important decision is how one will approximately represent real numbers in hardware.
This problem has had a longstanding, industry-standard solution: the IEEE 754 floating-point standard.&lt;sup id=&quot;fnref:ieee&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:ieee&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;1&lt;/a&gt;&lt;/sup&gt;
Yet,
when trying to squeeze
the most out of hardware
by building highly specialized designs,
does it make sense to use
general-purpose IEEE 754 floats?
If we know the numerical requirements
of our workload,
could we build a smaller,
faster,
or more power efficient datatype?
The answer is yes!
Researchers have already begun experimenting with new datatypes in academic and industrial accelerator designs.
For example, Google’s Tensor Processing Unit (the TPU) uses the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;bfloat&lt;/code&gt; type: a single-precision IEEE float which has been truncated to 16 bits.
Due to the lax numerical requirements
of many deep learning workloads,
this truncation often has no effect
on model accuracy,
while instantly cutting the storage cost
in half.&lt;sup id=&quot;fnref:jouppi2017datacenter&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:jouppi2017datacenter&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;2&lt;/a&gt;&lt;/sup&gt;&lt;sup id=&quot;fnref:tensorflowbfloat&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:tensorflowbfloat&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;3&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;p&gt;Before researchers begin building hardware for their datatype, however, they first need to determine how their datatype will behave numerically in the workloads they care about.
This often involves first building a software-emulated version of their datatype
(e.g. &lt;a href=&quot;http://www.jhauser.us/arithmetic/SoftFloat.html&quot; target=&quot;_blank&quot;&gt;Berkeley SoftFloat&lt;/a&gt; or &lt;a href=&quot;https://github.com/cjdelisle/libposit&quot; target=&quot;_blank&quot;&gt;libposit&lt;/a&gt;),
and then hacking the datatype directly into workloads,
to see how the workload performs
using the datatype.
Even better
is to integrate the datatype
directly into compilers themselves,
so that many different workloads
can be compiled
to use the datatype.
Both routes can be tedious, with the latter route often becoming unmanageable given the size and complexity of modern compilers.
&lt;a href=&quot;https://github.com/xman/tensorflow&quot; target=&quot;_blank&quot;&gt;One example taken from GitHub&lt;/a&gt; shows someone hacking the &lt;em&gt;posit&lt;/em&gt; datatype into TensorFlow.
The result is 237 commits, adding nearly 6000 lines of code and touching over 200 files across the codebase—and that’s just to add one datatype!
This amount of work is prohibitive for many researchers.&lt;/p&gt;
&lt;p&gt;To address these problems, we present the Bring Your Own Datatypes framework.
The framework enables easy exploration of new datatypes in deep learning workloads by allowing users to plug their simulated datatype into TVM.
Unlike the posits-in-Tensorflow example above, which enables a single new datatype in a compiler, the Bring Your Own Datatype framework enables a huge variety of user-defined types.&lt;/p&gt;
&lt;h2 id=&quot;bring-your-own-datatypes&quot;&gt;Bring Your Own Datatypes&lt;/h2&gt;
&lt;p&gt;The goal of the Bring Your Own Datatypes framework
is to enable users to run deep learning workloads
using custom datatypes.
In the Bring Your Own Datatypes framework,
“datatype” means a scalar type:
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float&lt;/code&gt;
or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uint&lt;/code&gt;, for example.
We do not handle more complicated data formats
such as &lt;a href=&quot;https://en.wikipedia.org/wiki/Block_floating_point&quot; target=&quot;_blank&quot;&gt;block floating point&lt;/a&gt;
or Intel’s &lt;a href=&quot;https://arxiv.org/abs/1711.02213&quot; target=&quot;_blank&quot;&gt;Flexpoint&lt;/a&gt;.
Additionally,
we only claim to support
&lt;em&gt;software emulated&lt;/em&gt; versions of these scalar datatypes;
we do not explicitly support compiling and running on custom datatype hardware.&lt;/p&gt;
&lt;p&gt;Each tensor in TVM
is assigned a type code,
which defines the datatype of the scalars
within the tensor.
A number of these type codes
have hard-coded meanings in TVM,
mapping to common datatypes
such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float&lt;/code&gt;.
However,
the vast majority of type codes
are unused.
The Bring Your Own Datatypes framework
allows users to
claim these unused type codes
and add their own new datatypes
at runtime.&lt;/p&gt;
&lt;p&gt;The framework is implemented as
a registry
which sits alongside
TVM’s normal datatype facilities.
There are two primary ways
in which the user interacts with
the datatype registry:
first, &lt;strong&gt;datatype registration,&lt;/strong&gt;
and second, &lt;strong&gt;lowering function registration.&lt;/strong&gt;
These steps are akin to
&lt;em&gt;declaration&lt;/em&gt; and &lt;em&gt;implementation&lt;/em&gt; of the datatype,
respectively.&lt;/p&gt;
&lt;p&gt;Please note that all referred code in this post are based on TVM repository’s master branch commit &lt;a href=&quot;https://github.com/apache/incubator-tvm/tree/4cad71d19fda6d8f7b750c791284c6dfdddf1f07&quot; target=&quot;_blank&quot;&gt;4cad71d&lt;/a&gt;. We will use an example &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; datatype which can be found under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/target/datatype/posit/posit-wrapper.cc&lt;/code&gt; and can be compiled in TVM with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;USE_BYODT_POSIT&lt;/code&gt; flag.&lt;sup id=&quot;fnref:posit&quot; role=&quot;doc-noteref&quot;&gt;&lt;a href=&quot;#fn:posit&quot; class=&quot;footnote&quot; rel=&quot;footnote&quot;&gt;4&lt;/a&gt;&lt;/sup&gt;&lt;/p&gt;
&lt;h3 id=&quot;datatype-registration&quot;&gt;Datatype Registration&lt;/h3&gt;
&lt;p&gt;To register the datatype,
the user assigns the datatype
a name and a type code,
where the type code comes from
the range of unused type codes
available to custom datatypes.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;datatype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'posit'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;150&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The above code registers
the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;'posit'&lt;/code&gt; datatype
with type code 150.
This registration step
allows TVM to parse programs
which use the custom type:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'x'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'float32'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'y'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'float32'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;x_posit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'custom[posit]16'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;y_posit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'custom[posit]16'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;z_posit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x_posit&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y_posit&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;z&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cast&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;z_posit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'float32'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;program&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;y&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;z&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;program&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# v0.0.4
# fn (%x: Tensor[(3), float32], %y: Tensor[(3), float32]) {
# %0 = cast(%x, dtype=&quot;custom[posit]16&quot;);
# %1 = cast(%y, dtype=&quot;custom[posit]16&quot;);
# %2 = add(%0, %1);
# cast(%2, dtype=&quot;float32&quot;)
# }
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The program above
casts &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float32&lt;/code&gt; inputs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;x&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;y&lt;/code&gt;
into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt;s,
adds them,
and casts the result back to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float32&lt;/code&gt;.
Once the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; type is registered,
TVM is able to parse the special &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dtype&lt;/code&gt; syntax
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;custom[&amp;lt;typename&amp;gt;]&lt;/code&gt;,
where &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;typename&amp;gt;&lt;/code&gt; is the name registered for the type.
This syntax also supports the usual
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;&amp;lt;bits&amp;gt;x&amp;lt;lanes&amp;gt;&lt;/code&gt; format;
here, we use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;16&lt;/code&gt; to indicate that
each &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; is 16 bits wide.
(The number of lanes
defaults to 1.)&lt;/p&gt;
&lt;h3 id=&quot;lowering-function-registration&quot;&gt;Lowering Function Registration&lt;/h3&gt;
&lt;p&gt;Though TVM can parse the above program,
it cannot yet compile it,
as TVM does not yet understand
how to compile operations
over the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; type.
To compile these programs,
we register &lt;em&gt;lowering functions&lt;/em&gt; for the custom datatype,
which help TVM convert the operations
into something it can understand and compile.&lt;/p&gt;
&lt;p&gt;Generally, the user is not expected to
lower operations
directly to LLVM or CUDA.
Instead, most code using custom datatypes
can be lowered into code which &lt;em&gt;doesn’t&lt;/em&gt; use custom datatypes,
with some simple tricks.
We can then rely on native TVM
to understand and compile the code.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-datatypes/lowering.png&quot; alt=&quot;A lowering function lowering an add over `posit`s to a library call over `uint16_t`s&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
&lt;center&gt;
Figure 1: The expected result of a user's registered lowering function. A lowering function should convert a program using custom datatypes to a program which native TVM can understand and compile (in this case, a call to an external library, taking two &lt;tt&gt;uint16_t&lt;/tt&gt;s).
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;Figure 1 shows a common pattern.
Let’s assume we are
interested in exploring the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; type,
and have chosen to run some workloads
by plugging a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; emulation library (e.g. &lt;a href=&quot;https://github.com/stillwater-sc/universal&quot; target=&quot;_blank&quot;&gt;Stillwater Universal&lt;/a&gt;) into TVM
via the Bring Your Own Datatypes framework.
Our workload is a simple program
which adds two &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; inputs.
Native TVM does not understand
how to implement &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; addition—but it doesn’t need to,
as we have a library implementing our datatype!
The library contains an implementation of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; addition,
alongside other operators such as multiplication and square root.
To implement this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt; addition,
we’d just like to call into our library.
Thus, our Add node should become a Call node,
calling out to a function (call it &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Posit16es2Add&lt;/code&gt;) in our library.
To store the bits of the input &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt;s
inside a type that TVM understands,
we use 16-bit unsigned integers.
The resulting program
is one that TVM can understand and compile—it
is simply a call to an external library function,
taking two unsigned integers.&lt;/p&gt;
&lt;p&gt;To achieve the above lowering,
we register a lowering function
for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;datatype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;datatype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_lower_func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;({&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;16&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'Posit16es2Add'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}),&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;'Add'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'llvm'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'posit'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The above code registers
a lowering function
for a specific operator (Add),
compilation target (LLVM),
datatype (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt;), and bit length (16).
The first argument
is the lowering function.
This can be any function
taking a TVM IR node
and returning a new TVM IR node.
In our case,
we use a helper function
provided by the Bring Your Own Datatypes framework.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tvm.target.datatype.create_lower_func({16:'Posit16es2Add'})&lt;/code&gt;
creates a lowering function
for the common pattern described above.
The resulting function
converts the arguments of the given node
to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;uint16_t&lt;/code&gt;,
and then converts the node itself
into a call to the given function name
(in this case, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;'Posit16es2Add'&lt;/code&gt; for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;posit&lt;/code&gt;s of bit length 16).
We pass a dictionary to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;create_lower_func&lt;/code&gt; so that TVM can dispatch
to the appropriate function name based on the bit length of the datatype.&lt;/p&gt;
&lt;p&gt;To implement a custom datatype,
the user will need to register
a lowering function for every operator
in the workload they would like to run.
For a network like ResNet,
this will be around 10 operators,
including things like, Add, Div, various Casts, and Max.
In our tests,
registering a datatype
and all lowering functions
takes around 40 lines of Python.
Once all needed operators
are registered,
custom datatype workloads
can be run
as easily as
any other TVM program!&lt;/p&gt;
&lt;h1 id=&quot;wrapping-up&quot;&gt;Wrapping Up&lt;/h1&gt;
&lt;p&gt;The Bring Your Own Datatypes framework
brings user-defined datatypes to TVM.
We hope this will encourage datatype researchers
to use TVM in their research;
similarly,
we hope this will spark interest
in custom datatypes
within the deep learning community.
For more documentation about the Bring Your Own Datatypes framework
please visit the &lt;a href=&quot;https://tvm.apache.org/docs/tutorials/dev/bring_your_own_datatypes.html#sphx-glr-tutorials-dev-bring-your-own-datatypes-py&quot; target=&quot;_blank&quot;&gt;Bring Your Own Datatypes to TVM&lt;/a&gt; developer tutorial.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;em&gt;Gus Smith is a PhD student at the University of Washington working with Luis Ceze and Zachary Tatlock at the intersection of computer architecture and programming languages. His website is &lt;a href=&quot;https://justg.us&quot; target=&quot;_blank&quot;&gt;justg.us&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;em&gt;&lt;a href=&quot;https://github.com/hypercubestart&quot; target=&quot;_blank&quot;&gt;Andrew Liu&lt;/a&gt; is an undergraduate student at the University of Washington and a member of UW CSE &lt;a href=&quot;https://sampl.cs.washington.edu/&quot; target=&quot;_blank&quot;&gt;SAMPL&lt;/a&gt; and &lt;a href=&quot;https://uwplse.org/&quot; target=&quot;_blank&quot;&gt;PLSE&lt;/a&gt; labs.&lt;/em&gt;&lt;/p&gt;
&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
&lt;div class=&quot;footnotes&quot; role=&quot;doc-endnotes&quot;&gt;
&lt;ol&gt;
&lt;li id=&quot;fn:ieee&quot; role=&quot;doc-endnote&quot;&gt;
&lt;p&gt;&lt;a href=&quot;https://standards.ieee.org/standard/754-2019.html&quot; target=&quot;_blank&quot;&gt;754-2019 - IEEE Standard for Floating-Point Arithmetic&lt;/a&gt; &lt;a href=&quot;#fnref:ieee&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&quot;fn:jouppi2017datacenter&quot; role=&quot;doc-endnote&quot;&gt;
&lt;p&gt;Jouppi, Norman P., et al. “In-datacenter performance analysis of a tensor processing unit.” Proceedings of the 44th Annual International Symposium on Computer Architecture. 2017. &lt;a href=&quot;#fnref:jouppi2017datacenter&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&quot;fn:tensorflowbfloat&quot; role=&quot;doc-endnote&quot;&gt;
&lt;p&gt;&lt;a href=&quot;https://cloud.google.com/tpu/docs/bfloat16&quot; target=&quot;_blank&quot;&gt;Using bfloat16 with TensorFlow models&lt;/a&gt; &lt;a href=&quot;#fnref:tensorflowbfloat&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li id=&quot;fn:posit&quot; role=&quot;doc-endnote&quot;&gt;
&lt;p&gt;&lt;a href=&quot;https://posithub.org/docs/BeatingFloatingPoint.pdf&quot; target=&quot;_blank&quot;&gt;Beating Floating Point at its Own Game: Posit Arithmetic&lt;/a&gt; &lt;a href=&quot;#fnref:posit&quot; class=&quot;reversefootnote&quot; role=&quot;doc-backlink&quot;&gt;&amp;#8617;&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;/ol&gt;
&lt;/div&gt;
</content>
</entry>
<entry>
<title>How to Bring Your Own Codegen to TVM</title>
<link href="https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm"/>
<updated>2020-07-15T00:00:00-04:00</updated>
<id>https://tvm.apache.org/2020/07/15/how-to-bring-your-own-codegen-to-tvm</id>
<content type="html">&lt;p&gt;To free data scientists from worrying about the performance when developing a new model, hardware backend providers (e.g., Intel, NVIDIA, ARM, etc) either provide kernel libraries such as cuBLAS or cuDNN with many commonly used deep learning kernels, or provide frameworks such as DNNL or TensorRT with a graph engine to let users describe their models in a certain way to achieve high performance. In addition, emerging deep learning accelerators also have their own compilers, kernel libraries, or runtime frameworks.&lt;/p&gt;
&lt;p&gt;However, users have to learn a new programming interface when they attempt to work on a new kernel library or a device. As a result, the demand for a unified programming interface becomes more and more important to let all users and hardware backend providers stand on the same page.&lt;/p&gt;
&lt;p&gt;To share the programming interface with widely used deep learning frameworks, many hardware device providers have attempted to integrate their devices backend to TensorFlow. However, since TensorFlow does not provide an official backend interface for new backends, you have to hack the TensorFlow for registration, which involves many source file changes and makes the future maintenance difficult.&lt;/p&gt;
&lt;p&gt;In this post, we demonstrate how you, as a hardware backend provider, can easily leverage the Bring Your Own Codegen (BYOC) framework to integrate the kernel library/compiler/framework of your hardware device to TVM. The most important advantage of leveraging BYOC framework is that &lt;strong&gt;&lt;em&gt;all related source files of your devices are self-contained, so the codegen/runtime of your devices are pluggable to the TVM code base.&lt;/em&gt;&lt;/strong&gt; It means that 1) the TVM code base with your codegen would be upstream compatible, and 2) TVM users can choose to enable the codegen/runtime based on their needs.&lt;/p&gt;
&lt;p&gt;In the rest of this post, we first illustrate a scenario that you may need TVM with BYOC, followed by an overview of the BYOC compilation and runtime flows. Then, we step-by-step illustrate how to integrate a vendor library or an execution engine to TVM with BYOC by using Intel DNNL (a.k.a. MKL-DNN, OneDNN) as a running example.&lt;/p&gt;
&lt;h2 id=&quot;bring-an-asic-accelerator-to-tvm&quot;&gt;Bring an ASIC Accelerator to TVM&lt;/h2&gt;
&lt;p&gt;Let’s first make a scenario to illustrate why you want to bring your accelerator to TVM and what features you can expect from the BYOC framework. If you are not sure whether your case is suitable for BYOC, you are welcome to raise a discussion at &lt;a href=&quot;https://discuss.tvm.ai&quot;&gt;discuss.tvm.ai&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Imagining that you just made an edge device platform with an ARM CPU and a fantastic accelerator that has achieved amazing performance for common image classification models. In other words, your accelerator does well on Conv2D, ReLU, GEMM, and other widely used CNN operators.&lt;/p&gt;
&lt;p&gt;Unfortunately, object detection models are getting more and more popular as well, and your customers need to run both image classification and object detection models on your platform. Although your accelerator is capable of executing almost all operators in object detection models, one operator (e.g., non-maximum suppression, NMS) is missing.&lt;/p&gt;
&lt;h3 id=&quot;let-tvm-execute-unsupported-operators&quot;&gt;Let TVM execute unsupported operators&lt;/h3&gt;
&lt;p&gt;Since TVM has multiple codegens for different backends, it is easy for the open source community to implement new operators on CPU or GPU in a short time. Ideally, if you integrate the compilation flow of your accelerator to TVM with BYOC, TVM will perform Relay graph partitioning to offload a part of the graph to your accelerator while keeping others on TVM. As a result, you can claim that your platform is capable of running all models without worrying about new operators.&lt;/p&gt;
&lt;h3 id=&quot;customize-graph-level-optimization&quot;&gt;Customize graph-level optimization&lt;/h3&gt;
&lt;p&gt;Your ASIC accelerator must have its own compilation flow. Usually, it could be one of the following cases:&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Generate a graph representation and feed it to a graph engine&lt;/strong&gt;:
You may have your own graph engine that is capable of executing a graph (or a neural network model) on your accelerator. For example, both Intel DNNL and NVIDIA TensorRT use an engine to run a whole graph or a model, so that they are able to 1) reduce memory transaction between operators and 2) optimize graph execution with operator fusion.&lt;/p&gt;
&lt;p&gt;In order to achieve the above two optimizations, you may need to process the graph during the compilation time. For example, Conv2D and bias addition are two separate operators in TVM, but they may be one operator (Conv2D with bias addition capability) on your accelerator. In this case, you may want to optimize the graph by replacing the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d - add&lt;/code&gt; graph pattern to a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;your_conv2d_with_bias&lt;/code&gt; node.&lt;/p&gt;
&lt;p&gt;If your compilation flow falls into this case, then we recommend reading all the rest sections in this post but skipping &lt;a href=&quot;#bring-dnnl-to-tvm-c-source-codegen&quot;&gt;Bring DNNL to TVM: C Source Codegen&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Generate assembly code and compile it to an executable binary&lt;/strong&gt;:
If you do not have an end-to-end execution framework for your platform like the previous case, you may have a compiler to compile a program in assembly code of your ISA. In order to feed the assembly code to your compiler, you will need a codegen to generate and optimize the assembly code from a Relay graph.&lt;/p&gt;
&lt;p&gt;If your compilation flow falls into this case, then we recommend reading all the rest sections in this post but skipping &lt;a href=&quot;#bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;how-byoc-works&quot;&gt;How BYOC Works&lt;/h2&gt;
&lt;p&gt;We then briefly explain how BYOC framework works. For more detail explanations of underlying framework components and their implementations, please refer to the &lt;a href=&quot;[https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html](https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html)&quot;&gt;developer document&lt;/a&gt;. In short, given a Relay graph in Figure 1, BYOC framework does the following steps:&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/original_graph.png&quot; alt=&quot;The original Relay graph&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
&lt;center&gt;
Figure 1: The Original Relay Graph.
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h3 id=&quot;1-graph-annotation&quot;&gt;1. Graph Annotation&lt;/h3&gt;
&lt;p&gt;Taking a user-provided Relay graph, our first step is to annotate the nodes that potentially can be offloaded to your accelerator in the graph. You will need to follow &lt;a href=&quot;#bring-dnnl-to-tvm-annotation-rules&quot;&gt;Bring DNNL to TVM: Annotation Rules&lt;/a&gt; to implement a whitelist of supported operators, or a graph pattern list of customized composite operators. An example annotation result is shown in Figure 2.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_annotation.png&quot; alt=&quot;The Graph with Annotations&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
&lt;center&gt;
Figure 2: The Graph with Annotations.
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h3 id=&quot;2-graph-transformation&quot;&gt;2. Graph Transformation&lt;/h3&gt;
&lt;p&gt;The second step is to transform and optimize the graph based on the annotations. Specifically, BYOC performs the following transformations.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2.1: Merge compiler region&lt;/strong&gt;: As can be seen in Figure 2, we now have many “regions” in the graph that can be offloaded to your accelerator, but some of them can actually be merged to reduce the data transfer and kernel launching overhead. Accordingly, step 2.1 uses a greedy algorithm to merge as many of those regions as possible while guaranteeing the functional correctness. The result is depicted in Figure 3.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_merging_regions.png&quot; alt=&quot;After Merging Compiler Regions&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
&lt;center&gt;
Figure 3: After Merging Compiler Regions.
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;2.2: Partition Graph&lt;/strong&gt;: For each region from the previous step, we create a Relay function with an attribute &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compiler&lt;/code&gt; to indicate that this Relay function should be entirely offloaded to your accelerator, as shown in Figure 4.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/bring-your-own-codegen/after_partitioning.png&quot; alt=&quot;After Graph Partitioning&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
&lt;center&gt;
Figure 4: After Graph Partitioning.
&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h3 id=&quot;3-code-generation&quot;&gt;3. Code Generation&lt;/h3&gt;
&lt;p&gt;Now we know which part of the Relay graph should be offloaded. In this step, we sequentially send every Relay function with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compiler=your_accelerator&lt;/code&gt; to your codegen. Your codegen should compile the Relay function to the form that matches your own compilation flow. It can be either C source code or any text formats.&lt;/p&gt;
&lt;p&gt;Finally, all compiled functions will be serialized along with other non-offloaded Relay functions to a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.so&lt;/code&gt; file by the TVM &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;export_library&lt;/code&gt; Python API. In other words, the user will get only one &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.so&lt;/code&gt; file after running this flow.&lt;/p&gt;
&lt;h3 id=&quot;4-runtime&quot;&gt;4. Runtime&lt;/h3&gt;
&lt;p&gt;You may also need to implement a runtime to initialize your graph engine (if applicable) and execute the compiled functions. During the inference, TVM runtime (i.e., graph runtime or VM) will leverage your runtime to invoke the offloaded functions when the TVM runtime encounters the corresponding function call in Figure 4. Your runtime is responsible for launching the compiled function with the given input tensor arrays and filling in the results to the output tensor arrays.&lt;/p&gt;
&lt;p&gt;In the rest of this post, we use DNNL as an example to demonstrate how to achieve the above workflow using the BYOC framework. Please note that all referred code and line number in this post are based on the TVM repository’s master branch commit &lt;a href=&quot;https://github.com/apache/incubator-tvm/tree/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8&quot;&gt;8a0249c&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;bring-dnnl-to-tvm-annotation-rules&quot;&gt;Bring DNNL to TVM: Annotation Rules&lt;/h2&gt;
&lt;p&gt;The BYOC framework provides two approaches for you to describe the supported operators and patterns. You can use both of them simultaneously. In this section, we use DNNL as an example to show how to make use of them. The complete implementation is available &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/python/tvm/relay/op/contrib/dnnl.py&quot;&gt;here&lt;/a&gt;. Note that we put the annotation rules for your codegen under &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;python/tvm/relay/op/contrib/your_codegen_name.py&lt;/code&gt;.&lt;/p&gt;
&lt;h3 id=&quot;rules-for-single-operators&quot;&gt;Rules for single operators&lt;/h3&gt;
&lt;p&gt;You can intuitively specify which Relay operators are supported by your accelerator with the BYOC API. For example, we use the following code snippet to build a rule saying that our DNNL codegen supports Conv2D:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_op_attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.conv2d&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;target.dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_dnnl_conv2d_wrapper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This registers a new attribute &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;target.dnnl&lt;/code&gt; to Relay &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nn.conv2d&lt;/code&gt; operator. By this way, the BYOC annotation could invoke &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;target.dnnl()&lt;/code&gt; for every operator in the graph to check if it is supported in DNNL codegen.&lt;/p&gt;
&lt;p&gt;On the other hand, it might be tedious to write the above code snippet for every single operator. For the DNNL implementation, we implemented a helper function, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_register_external_op_helper&lt;/code&gt;, to make our life easier:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;supported&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_op_attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;target.dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_func_wrapper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;supported&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_func_wrapper&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.batch_norm&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.conv2d&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.dense&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;nn.relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;add&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;subtract&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;_register_external_op_helper&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;multiply&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In the above example, we specify a list of operators that can be supported by DNNL codegen.&lt;/p&gt;
&lt;h3 id=&quot;rules-for-graph-patterns&quot;&gt;Rules for graph patterns&lt;/h3&gt;
&lt;p&gt;Your accelerator or compiler may have optimized some patterns (e.g., Conv2D + add + ReLU) to be a single instruction or an API. In this case, you can specify a mapping from a graph pattern to your instruction/API. For the case of the DNNL, its Conv2D API already includes bias addition and it allows the next ReLU to be attached, so we can call DNNL as the following code snippet (the complete implementation can be found &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L151&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;DNNLConv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;has_bias&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bool&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;has_relu&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;false&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// ... skip ...&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_desc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_forward&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prop_kind&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;forward_inference&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;algorithm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_direct&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conv_src_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_weights_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_bias_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv_dst_md&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;strides_dims&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;padding_dims_l&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;padding_dims_r&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Attach ReLU&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;primitive_attr&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;has_relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post_ops&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append_eltwise&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;algorithm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;eltwise_relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_post_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv2d_prim_desc&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;convolution_forward&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;primitive_desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conv_desc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;engine_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// ... skip ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In this case, except for a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d&lt;/code&gt;, we would like to map the graph pattern &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d+relu&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DNNLConv2d(false, true)&lt;/code&gt;, and map &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d+add+relu&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DNNLConv2d(true, true)&lt;/code&gt;. We can achieve it with the following code snippet:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bias&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.conv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'add'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;bias&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.relu'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv_out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;@&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;register_pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conv2d_bias_relu_pat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl.conv2d_bias_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conv2d_relu_pat&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl_patterns&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv2d_bias_relu_pat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;conv2d_relu_pat&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_patterns&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In the DNNL example, we implemented two patterns with different names so that we can easily recognize them in the codegen. Note that the patterns are implemented in the Relay pattern language. You can follow &lt;a href=&quot;https://tvm.apache.org/docs/langref/relay_pattern.html&quot;&gt;this tutorial&lt;/a&gt; to learn how to write your own patterns.&lt;/p&gt;
&lt;p&gt;With the pattern table, we can then use a Relay pass to perform the transformation from&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;%1 = nn.conv2d(%data, %weight, ...)
%2 = add(%1, %bias)
%3 = nn.relu(%2)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;to&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;%1 = fn(%input1, %input2, %input3,
Composite=&quot;dnnl.conv2d_bias_relu&quot;,
PartitionedFromPattern=&quot;nn.conv2d_add_nn.relu_&quot;) {
%1 = nn.conv2d(%input1, %input2, ...)
%2 = add(%1, %input3)
nn.relu(%2)
}
%2 = %1(%data, %weight, %bias)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Thus, the DNNL codegen can get the pattern name &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d_bias_relu&lt;/code&gt; and map &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;%1&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DNNLConv2d(true, true)&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;As you may have noticed that we also have an attribute called “PartitionedFromPattern” in the composite function. This could be helpful if your pattern contains &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;wildcard&lt;/code&gt; operators. For example we may have a pattern table &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(&quot;conv2d_with_something&quot;, conv2d -&amp;gt; *)&lt;/code&gt;:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;make_pattern&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;with_bias&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'nn.conv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;weight&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;wildcard&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;conv&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In this case, you will get a composite function with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Composite=conv2d_with_something&lt;/code&gt;, but you have no idea about what graph it actually matched. That’s where PartitionedFromPattern comes into play. You can know that if the matched graph is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d -&amp;gt; add&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d -&amp;gt; relu&lt;/code&gt; by looking at &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;PartitionedFromPattern&lt;/code&gt; to see if it is &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nn.conv2d_add_&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;nn.conv2d_nn.relu_&lt;/code&gt;.&lt;/p&gt;
&lt;h2 id=&quot;bring-dnnl-to-tvm-relay-graph-transformation&quot;&gt;Bring DNNL to TVM: Relay Graph Transformation&lt;/h2&gt;
&lt;p&gt;With the annotation rules from the previous step, we can now apply a list of BYOC Relay passes to transform the Relay graph from Figure 1 to Figure 4:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;create_relay_module_from_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 1
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MergeComposite&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pattern_table&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AnnotateTarget&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 2
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;MergeCompilerRegions&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 3
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PartitionGraph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# Output: Figure 4
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;As can be seen, each Relay pass can be mapped to a step we have introduced in &lt;a href=&quot;#how-byoc-works&quot;&gt;How BYOC Works&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/h2&gt;
&lt;p&gt;Now let’s implement the DNNL codegen that serializes a Relay graph to a JSON representation, and then implement the DNNL JSON runtime to deserialize and execute the graph. &lt;em&gt;Note that if you attempt to implement a codegen to generate C-compatible programs, you may want to directly proceed to the next section.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To enable DNNL JSON codegen/runtime in TVM to work on this example, please make sure DNNL is available on your machine, and build the TVM with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN ON)&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;config.cmake&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The DNNL codegen is implemented in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt;. Since we implemented DNNL codegen in both forms in this file for illustration purpose, you could focus on the part covered by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;USE_JSON_RUNTIME&lt;/code&gt; macro when tracing the code.&lt;/p&gt;
&lt;p&gt;We first register the codegen with TVM registration API (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L510&quot;&gt;L510&lt;/a&gt;). This registration makes TVM compile engine dispatch the Relay function with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compiler=&amp;lt;your codegen&amp;gt;&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;relay.ext.&amp;lt;your codegen&amp;gt;&lt;/code&gt;. Then we implement the entry function of the DNNL compiler (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L490&quot;&gt;L490&lt;/a&gt;). Please read the comments embedded in the code snippet for details:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ObjectRef&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// &quot;ref&quot; should be the paritioned Relay function with kCompiler=dnnl.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IsInstance&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FunctionNode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Downcast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Get the function name as the symbol to match in runtime.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GetExtSymbol&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Serialize the function to a JSON string (introduce later).&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DNNLJSONSerializer&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;serialize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_json&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetJSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// The constant tensor names that have been bound to the module.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// All constant tensors will be serialzied along with the JSON graph&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// when export_library is invoked.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;serializer&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// The function to create DNNL JSON runtime (introduce later).&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Registry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;runtime.DNNLJSONRuntimeCreate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nullptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Cannot find JSON runtime module to create&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Create a DNNL runtime module that can run the serialized function.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;relay.ext.dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Note that &lt;strong&gt;&lt;em&gt;each runtime module is only responsible for one Relay function, meaning that you may have several DNNL runtime modules in a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.so&lt;/code&gt; file.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;h3 id=&quot;dnnl-json-serialization&quot;&gt;DNNL JSON Serialization&lt;/h3&gt;
&lt;p&gt;Next, we implement DNNL JSON serializer (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L429&quot;&gt;L429&lt;/a&gt;). We derived it from the BYOC JSON codegen (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/codegen_json/codegen_json.h&quot;&gt;src/relay/backend/contrib/codegen_json/codegen_json.h&lt;/a&gt;). The special process in DNNL JSON serializer attempts to serialize a composite function call to a JSON node that can be interpreted by DNNL JSON runtime. Assuming we have a composite function which matches the pattern &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dnnl.conv2d_relu&lt;/code&gt;, then the BYOC JSON codegen will generate the following JSON node:&lt;/p&gt;
&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;op:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;kernel&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;name:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;inputs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;attrs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;PartitionedFromPattern:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;nn.conv2d_nn.relu_&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;shape:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The problem is that we still need the Conv2D attributes such as padding and strides in runtime, but the BYOC JSON serializer only attaches the attributes of the composite function instead of the body operators. On the other hand, the customized DNNL JSON serializer attaches the attributes of the first and only Conv2D in the composite function to generate the following JSON node:&lt;/p&gt;
&lt;div class=&quot;language-json highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;op:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;kernel&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;name:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;dnnl.conv2d_relu&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;inputs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;attrs:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;shape:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;data_layout:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;NCHW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;kernel_layout:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s2&quot;&gt;&quot;OIHW&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;strides:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;err&quot;&gt;padding:&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;&lt;span class=&quot;w&quot;&gt; &lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;w&quot;&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;As can be seen from the DNNL JSON serializer, you can customize the serializer to generate any forms in JSON you like as long as your JSON runtime could interpret them.&lt;/p&gt;
&lt;h3 id=&quot;dnnl-json-runtime&quot;&gt;DNNL JSON Runtime&lt;/h3&gt;
&lt;p&gt;We then implement a DNNL JSON runtime to interpret and execute the serialized JSON graph. We put it under &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/runtime/contrib/dnnl/dnnl_json_runtime.cc&lt;/code&gt;&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Again, we first register two APIs to create the runtime so that we can use them anywhere. The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;runtime.DNNLJSONRuntimeCreate&lt;/code&gt; is used in the previous part after serialization, and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;runtime.module.loadbinary_dnnl_json&lt;/code&gt; could be used when loading the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.so&lt;/code&gt; back.&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// Create a DNNL JSON runtime to interpret and execute the given JSON graph.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLJSONRuntimeCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;symbol_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;const_names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make_object&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;symbol_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;const_names&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;runtime.DNNLJSONRuntimeCreate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntimeCreate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;runtime.module.loadbinary_dnnl_json&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;JSONRuntimeBase&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LoadFromBinary&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Now we explain DNNL JSON runtime implementation. The basic class structure is:&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;class&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DNNLJSONRuntime&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;public&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;JSONRuntimeBase&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;char&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;type_key&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;dnnl_json&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Init&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NDArray&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Initialize the DNNL graph engine.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;BuildEngine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Setup constants entries for weights.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;CHECK_EQ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;const_idx_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;The number of input constants must match the number of required.&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;SetupConstants&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;consts&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;Run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// 1. Fill in the input buffers.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// 2. Invoke the engine through intepreting the stream.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// 3. Read and fill output buffers.&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Init&lt;/code&gt; function is in charge of building the DNNL engine by interpreting the JSON graph string (see &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L93&quot;&gt;L93&lt;/a&gt; for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BuildEngine&lt;/code&gt;), and filling the constant weights to the corresponding data entry buffers (the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SetupConstant&lt;/code&gt; is implemented in the JSON runtime base class so you only need to invoke it in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Init&lt;/code&gt;). Note that this function will be called only once even we run multiple times of inferences.&lt;/p&gt;
&lt;p&gt;Next, the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Run&lt;/code&gt; function (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl_json_runtime.cc#L64&quot;&gt;L64&lt;/a&gt;) first writes the input tensors, which may come from user inputs or constant weights, to the corresponding DNNL memory buffers we initialized when building the DNNL engine. Then launch the DNNL engine to execute the JSON graph. Finally, it writes the DNNL output memory buffers back to the corresponding output tensors.&lt;/p&gt;
&lt;p&gt;Since the rest implementation in DNNL JSON runtime are too DNNL specific to be dived into details in this post, we will stop here. We would like to emphasize that while the DNNL JSON runtime is a good reference to start with, your JSON runtime could be fully customized to fit your requirements.&lt;/p&gt;
&lt;h2 id=&quot;bring-dnnl-to-tvm-c-source-codegen&quot;&gt;Bring DNNL to TVM: C Source Codegen&lt;/h2&gt;
&lt;p&gt;Now let’s implement the DNNL codegen that generates C source code which invokes DNNL APIs to execute the Relay graph.&lt;em&gt;Note that if you attempt to implement a codegen to generate other graph representation like in JSON format, you may want to read &lt;a href=&quot;#bring-dnnl-to-tvm-json-codegenruntime&quot;&gt;Bring DNNL to TVM: JSON Codegen/Runtime&lt;/a&gt; and skip this section.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;To enable DNNL C source codegen in TVM to work on this example, please make sure DNNL is available on your machine, and build the TVM with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN C_SRC)&lt;/code&gt; in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;config.cmake&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;The DNNL codegen is implemented in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt;. Since we implemented DNNL codegen in both forms in this file for illustration purpose, you could focus on the part &lt;strong&gt;NOT&lt;/strong&gt; covered by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;USE_JSON_RUNTIME&lt;/code&gt; macro when tracing the code.&lt;/p&gt;
&lt;p&gt;We first register the codegen with TVM registration API (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L510&quot;&gt;L510&lt;/a&gt;). This registration makes TVM compile engine dispatch the Relay function with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Compiler=&amp;lt;your codegen&amp;gt;&lt;/code&gt; to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;relay.ext.&amp;lt;your codegen&amp;gt;&lt;/code&gt;. Then we implement the entry function of the DNNL compiler (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L490&quot;&gt;L490&lt;/a&gt;):&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ObjectRef&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DNNLModuleCodegen&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CreateCSourceModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TVM_REGISTER_GLOBAL&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;relay.ext.dnnl&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_body_typed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DNNLCompiler&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Note that &lt;strong&gt;&lt;em&gt;each runtime module is only responsible for one Relay function, meaning that you may have several DNNL runtime modules in a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.so&lt;/code&gt; file.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Then, we derive &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CSourceModuleCodegenBase&lt;/code&gt; to implement &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DNNLModuleCodegen&lt;/code&gt; in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L362&quot;&gt;L362&lt;/a&gt;. While &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CSourceModuleCodegenBase&lt;/code&gt; is in charge of other module level processes such as serialization, we only need to implement the DNNL code generation in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;CreateCSourceModule&lt;/code&gt; function (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L389&quot;&gt;L389&lt;/a&gt;):&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Module&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CreateCSourceModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ObjectRef&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;amp;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;override&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Include headers&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// ...skip...&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;code_stream_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;#include &amp;lt;dnnl/dnnl_kernel.h&amp;gt;&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// ...skip...&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// &quot;ref&quot; should be the paritioned Relay function with kCompiler=dnnl.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;IsInstance&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FunctionNode&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;());&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;GenDNNLFunc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Downcast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ref&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// &quot;code&quot; is the generated C code with DNNL APIs.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;string&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;code&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;code_stream_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;();&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// &quot;res&quot; is a tuple of constant weights (symbols, values).&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// All constant tensors will be serialzied along with the generated C code&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// when export_library is invoked.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sym&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;Array&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;String&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;variables&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Create a CSource module with all above artifacts.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;auto&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Registry&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;runtime.CSourceModuleCreate&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;CHECK&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;!=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;nullptr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;Cannot find csource module to create the external runtime module&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;code&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;c&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sym&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;variables&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, we implement &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;GenDNNLFunc&lt;/code&gt; (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L365&quot;&gt;L365&lt;/a&gt;) to generate the compilable C code with DNNL APIs as follows. Please see the embedded comments for the explanations of TVM C source runtime module compatible function interfaces.&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;c1&quot;&gt;// The example Relay graph: conv2d -&amp;gt; add -&amp;gt; relu.&lt;/span&gt;
&lt;span class=&quot;cp&quot;&gt;#include &amp;lt;cstdint&amp;gt;
#include &amp;lt;cstdlib&amp;gt;
#include &amp;lt;cstring&amp;gt;
#include &amp;lt;vector&amp;gt;
#include &amp;lt;tvm/runtime/c_runtime_api.h&amp;gt;
#include &amp;lt;tvm/runtime/container.h&amp;gt;
#include &amp;lt;tvm/runtime/packed_func.h&amp;gt;
#include &amp;lt;dlpack/dlpack.h&amp;gt;
#include &amp;lt;dnnl/dnnl_kernel.h&amp;gt;
&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;using&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;namespace&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contrib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Execute the conv2d-&amp;gt;add-&amp;gt;relu graph with DNNL.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dnnl_0_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Allocate intermediate buffers.&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;malloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4608&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;malloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4608&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;malloc&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4608&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Pre-implemented op-based DNNL functions.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl_conv2d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dnnl_0_i0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;14&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl_add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_i2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl_relu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;12&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Copy the final output to the corresponding buffer.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;memcpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4608&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;std&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;::&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;free&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;buf_2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// The wrapper function with all arguments in DLTensor type.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;extern&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;C&quot;&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;dnnl_0_wrapper_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DLTensor&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Cast all DLTensor to primitive type buffers and invoke the above&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// execution function.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dnnl_0_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg2&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;static_cast&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;float&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&amp;gt;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;out0&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&amp;gt;&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;));&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// The TVM macro to generate TVM runtime compatible function &quot;dnnl_0&quot;&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// from our generated &quot;dnnl_0_wrapper_&quot;.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TVM_DLL_EXPORT_TYPED_FUNC&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dnnl_0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dnnl_0_wrapper_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Note that the pre-implemented op-based DNNL functions are in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/runtime/contrib/dnnl/dnnl.cc&quot;&gt;src/runtime/contrib/dnnl/dnnl.cc&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Since the rest implementation in &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc&quot;&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/relay/backend/contrib/dnnl/codegen.cc&lt;/code&gt;&lt;/a&gt; are too DNNL specific to be dived into details in this post, we will stop here. The main idea is implementing a Relay graph visitor (&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/src/relay/backend/contrib/dnnl/codegen.cc#L138&quot;&gt;L138&lt;/a&gt;) to visit the given Relay function and generate the above C code. As long as your codegen is able to generate the TVM runtime compatible C code, you can fully customize the codegen to fit your requirements.&lt;/p&gt;
&lt;h3 id=&quot;c-source-compilation&quot;&gt;C Source Compilation&lt;/h3&gt;
&lt;p&gt;As you may have noticed, the output of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;DNNLCompiler&lt;/code&gt; is a module with the generated C code in text format, which has not been compiled by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gcc&lt;/code&gt; to be executable binary. In fact, the generated C code will be compiled when users call &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;export_libray(mod)&lt;/code&gt;, like the following code snippet:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;update_lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Include the path of src/runtime/contrib/dnnl/dnnl.cc
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;test_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dirname&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;realpath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expanduser&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;__file__&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;source_dir&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;test_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;..&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;..&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;..&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;contrib_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;os&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;source_dir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;src&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;runtime&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;contrib&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Setup the gcc flag to compile DNNL code.
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kwargs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;kwargs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;options&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;-O2&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;-std=c++14&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;-I&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;contrib_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tmp_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;util&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tempdir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;lib_name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'lib.so'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tmp_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relpath&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# The generated C code with DNNL APIs is compiled to a binary lib.so.
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;export_library&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fcompile&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kwargs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Load the lib.so back to a runtime module.
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PassContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opt_level&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;param&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;update_lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rt_mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contrib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_runtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;json&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h2 id=&quot;bring-dnnl-to-tvm-build-tvm-with-dnnl-codegenruntime&quot;&gt;Bring DNNL to TVM: Build TVM with DNNL Codegen/Runtime&lt;/h2&gt;
&lt;p&gt;Finally, we create &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8a0249cd4d12a2eb1a4e7a692a9265bc63fec5c8/cmake/modules/contrib/DNNL.cmake&quot;&gt;cmake/modules/contrib/DNNL.cmake&lt;/a&gt; to include the DNNL codegen when building TVM. For demonstration purpose our DNNL codegen has two implementations in the same cmake file. You can only focus on one of them based on your need.&lt;/p&gt;
&lt;p&gt;With the cmake file ready, now users can specify &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;set(USE_DNNL_CODEGEN ON)&lt;/code&gt; in their &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;build/config.cmake&lt;/code&gt; to enable the DNNL codegen.&lt;/p&gt;
&lt;hr /&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/zhiics&quot;&gt;Zhi Chen&lt;/a&gt; is a TVM PMC member as well as a senior engineer at SageMaker Neo, Amazon AI, AWS.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;a href=&quot;https://comaniac.github.io&quot;&gt;Cody Yu&lt;/a&gt; is a TVM reviewer as well as an applied scientist at Amazon AI, AWS.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;acknowledgment&quot;&gt;Acknowledgment&lt;/h2&gt;
&lt;p&gt;We would like to thank our colleague Animesh Jain for valuable discussions in the framework design; Tianqi Chen and Jared Roesch from OctoML for system design discussions and prototyping; Masahiro Masuda from the TVM community to help code review and improve the DNNL integration. We would also like to thank Ramana Radhakrishnan, Matthew Barrett, Manupa Karunaratne, and Luke Hutton from ARM, U.K. for contributing several helpful ideas, related Relay passes, and the Arm Compute Library (ACL) integration with BYOC.&lt;/p&gt;
</content>
</entry>
<entry>
<title>Bridging PyTorch and TVM</title>
<link href="https://tvm.apache.org/2020/07/14/bert-pytorch-tvm"/>
<updated>2020-07-14T00:00:00-04:00</updated>
<id>https://tvm.apache.org/2020/07/14/bert-pytorch-tvm</id>
<content type="html">
&lt;p&gt;(A more code-heavy variant is crossposted on the more PyTorch affine &lt;a href=&quot;https://lernapparat.de/transformers-pytorch-tvm/&quot;&gt;Lernapparat&lt;/a&gt;,
the Jupyter Notebook to follow along is on &lt;a href=&quot;https://github.com/t-vi/pytorch-tvmisc/tree/master/transformers-pytorch-tvm/&quot;&gt;github&lt;/a&gt;.)&lt;/p&gt;
&lt;p&gt;Some of the most intriguing applications of Artificial Intelligence have been in Natural Language Processing.
Models like BERT or GPT-2 and their variants can seemingly grasp enough of a text to continue it in a way that needs a second look to recognize as gibberish.&lt;/p&gt;
&lt;p&gt;These models belong to a class of neural network architectures called &lt;em&gt;Transformers&lt;/em&gt;. One of the favourite libraries
implementing them is the &lt;a href=&quot;https://github.com/huggingface/transformers/&quot;&gt;HuggingFace transformers library&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;But, in contrast to convolutional models or LSTMs where we have heavily optimized implementations, this is not as much the case for transformers.
So here we explore how TVM can fill the gap. We will do so in two steps:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;First we look at BERT inference and tuning that on TVM.&lt;/li&gt;
&lt;li&gt;Secondly, we start some more fundamental exploration of how one could use TVM for training in PyTorch.
Given the experimental nature, we focus on feasibility more than on the performance in this part.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&quot;optimizing-bert-inference-with-tvm&quot;&gt;Optimizing BERT Inference with TVM&lt;/h1&gt;
&lt;p&gt;So how do we get BERT from the transformer library to TVM?&lt;/p&gt;
&lt;p&gt;Helpfully, transformers supports tracing their model with the PyTorch JIT. We use their &lt;a href=&quot;https://huggingface.co/transformers/torchscript.html&quot;&gt;tutorial on it&lt;/a&gt;,
specifically the part until we have a traced model.&lt;/p&gt;
&lt;p&gt;The PyTorch traced model takes around 0.65-0.7 seconds for 100 runs on my AMD Radeon VII with the example inputs, which means 6.5-7ms per run.
We can try to see if we can use TVM get faster. Let converting our model to TVM is a breeze:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;shape_list&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;debugName&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;split&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'.'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sizes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;traced_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:]]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;mod_bert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params_bert&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;frontend&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pytorch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_pytorch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;traced_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;shape_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;default_dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;There will be a few warnings about not finding dtype information, but it goes well!
We can now build and run it. Building follows the standard TVM recipe. We also convert the PyTorch (cpu) tensors to TVM arrays.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'rocm -model=gfx906'&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# use what matches your GPU
&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;target_host&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'llvm'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;context&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tt_a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tokens_tensor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;st_a&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;segments_tensors&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;backend&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;compile_engine&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;clear&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# just to be sure, see https://github.com/apache/incubator-tvm/pull/5724
&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;PassContext&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opt_level&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod_bert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;target_host&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target_host&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params_bert&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;contrib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_runtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This will warn us a few times times:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; WARNING:autotvm:Cannot find config for ... batch_matmul.cuda .... A fallback configuration is used, which may bring great performance regression.
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Uh oh, &lt;em&gt;may bring great performance regression&lt;/em&gt;. Let us see.&lt;/p&gt;
&lt;p&gt;But first we run the model and see if the outputs match:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mf&quot;&gt;8.583069e-06&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mf&quot;&gt;8.493662e-07&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Looks good. Remember that we’re computing in float32, so $10^{-6}$ish is a good result.&lt;/p&gt;
&lt;p&gt;After building our model and setting the parameters, we time our model like this:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;range&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sync&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;timeit&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Ouch, it takes 6.65s per 100 runs, or 67ms per run of the model. That’s slow indeed. But the warning said that is was because it could not find (tuned) configurations. Let us then tune the tasks.&lt;/p&gt;
&lt;p&gt;Tuning does take half a day or so (I’m basically following the TVM tuning tutorial for ResNet tuning with autotvm.)&lt;/p&gt;
&lt;p&gt;After this, we can again build the model, this time with the new configuration. This time we should see no comments about missing configurations.
Now it’s in the region of 6.5-7ms per run, similar to PyTorch. This is what we get from this very elementary optimization of our operators. We can push it a little further, though.&lt;/p&gt;
&lt;p&gt;To see how, let us dive deep into BERT modeling and TVM.&lt;/p&gt;
&lt;p&gt;If you don’t want to get the full details, do skip the next section and scroll down to &lt;em&gt;Results&lt;/em&gt;. I should add that I would hope that this tuning part of the tutorial will obsolete itself in the sense that in some near future, you will get much better speed right out of the box or at least after some initial tuning. So if you don’t see a speedup between here and &lt;em&gt;Results&lt;/em&gt;, that’s because I did my homework in submitting patches.&lt;/p&gt;
&lt;h2 id=&quot;the-bert-model&quot;&gt;The BERT model&lt;/h2&gt;
&lt;p&gt;Let us take a closer look at what’s going on in BERT.&lt;/p&gt;
&lt;p&gt;Like many deep learning models, BERT comes with a bit some prologue (vocabulary embeddings) and epilogue (pooling) and the bulk is organized into similar-looking blocks, here we have 12 &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertLayer&lt;/code&gt; modules.
The &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;attention_mask&lt;/code&gt; is jsut to prevent BERT from looking at the answer when dealing with the question.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/bert_model.svg&quot; alt=&quot;Bert Model&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;So let us zoom in and look at a BertLayer in detail, since that ultimately is what we need make fast.
As we see in the net diagram, the main part of the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertLayer&lt;/code&gt; module is a submodule &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertSelfAttention&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/bert_layer.svg&quot; alt=&quot;BertLayer&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Now the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertSelfAttention&lt;/code&gt; captures the famed self-attention mechanism that is the hallmark of transformer models. (I cannot recommend Sascha Rush’s &lt;a href=&quot;http://nlp.seas.harvard.edu/2018/04/03/attention.html&quot;&gt;Annotated Transformer&lt;/a&gt; enough as a detailed walkthrough.)&lt;/p&gt;
&lt;h2 id=&quot;putting-the-bertlayer-under-the-microscope&quot;&gt;Putting the BertLayer under the Microscope&lt;/h2&gt;
&lt;p&gt;If we want go into details, we should want to run a BertLayer individually.
We grab the inputs of a BertLayer (see the Notebook for how) and convert a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertLayer&lt;/code&gt; to TVM as we did for the entire model.&lt;/p&gt;
&lt;p&gt;To look at the TVM module, we define a little visualization helper (loosely based on TVM &lt;a href=&quot;https://github.com/apache/incubator-tvm/pull/4370&quot;&gt;PR#4370&lt;/a&gt;).&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;kn&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;nn&quot;&gt;graphviz&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;visualize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;collapse_small&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node_attr_dict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;collect_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;set&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;visitor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;add&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;e&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;analysis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post_order_visit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;visitor&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# node_dict maps a Relay node to an index (node ID)
&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;_traverse_expr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;analysis&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;post_order_visit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;expr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_traverse_expr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;relayviz_nodes&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graphviz&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Digraph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'svg'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'node'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'box'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;to_str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Constant&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;repr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;lstrip&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'Constant('&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[:&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;raise&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;NotImplementedError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;to_str:&quot;&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;repr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;is_small_const&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;collapse_small&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;and&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Constant&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;False&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;runtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ndarray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;NDArray&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# Sort by node ID
&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;sorted&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;key&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;k&quot;&gt;lambda&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Function&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'Function'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_attr_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;body&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]),&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type_annotation&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;is&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;None&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;hasattr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type_annotation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'shape'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;x&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;x&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type_annotation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type_annotation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;typstr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'Tensor[{}, {}]'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;typstr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type_annotation&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;typstr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'?'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'ellipse'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;update&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_attr_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;'{}: {}'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name_hint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typstr&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;d&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Tuple&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'Tuple[...])'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_attr_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;field&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;field&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]),&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Constant&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;not&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_small_const&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# small consts are shown in ops
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'Constant({}, {})'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt;
&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_attr_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Call&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;args_with_edge&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;arg_str_list&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;is_small_const&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;arg_str_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;arg_str_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'·'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;args_with_edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;append&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;arg_str&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;', '&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg_str_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;getattr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()}&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;hasattr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'keys'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;#attrs = inspect.getmembers(node.attrs)
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attr_str_list&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'='&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;&amp;lt;&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;20&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;else&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;...&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attrs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;attr_str_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;attr_str&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'| '&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;', '&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attr_str_list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;attr_str&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;''&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;collect_ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;if&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'_'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;join&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ops&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'...'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;attr_str&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;''&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;s&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg_str&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;attr_str&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;)'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;s&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_attr_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;args_with_edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]),&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ir&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Op&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;# dot.node(str(node_id), 'Op {}'.format(node.name))
&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;pass&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# covered in call
&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TupleGetItem&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'TupleGetItem(idx={})'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;index&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_attr_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tuple_value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]),&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;elif&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;isinstance&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Let&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'Let(XX)'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_attr_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;value&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]),&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;edge&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;str&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;var&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;else&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;raise&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;RuntimeError&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;'Unknown node type. node_id: {}, node: {}'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;format&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node_id&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;type&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;node&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dot&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Let’s run that on our main function. For some reason (well, to be fully general, probably) the PyTorch converter will convert &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Linear&lt;/code&gt; layers to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;batch_matmul&lt;/code&gt; rather than just &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dense&lt;/code&gt;. We’ll get back to this in a bit. As TVM’s &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;batch_matmul&lt;/code&gt; has the contraction axis last on both operands (unlike PyTorch), there are quite a few transpose operations, too.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;visualize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'main'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/bert-tvm_49_0.svg&quot; alt=&quot;svg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;In addition to our named inputs, we see a number of unnamed (numbered) variables. These are the neural network parameters.&lt;/p&gt;
&lt;p&gt;Let us compile our model.&lt;/p&gt;
&lt;p&gt;Just like the full model, we can run and time our submodule after checking that it computes the same quantities.&lt;/p&gt;
&lt;p&gt;100 runs take 20.2ms. The back of the envelope calculation here is that with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertLayer&lt;/code&gt; in PyTorch we are spending about 0.2ms in this layer, so about 2.4ms on 12 layers - a not the majority but a sizeable part of the 6-7ms overall runtime. Let’s compare to TVM. (A good rule is to never optimize without measuring.)&lt;/p&gt;
&lt;p&gt;Similarly, TVM clocks in at 18.2ms for 100 runs. So here we are again roughly on par with PyTorch.&lt;/p&gt;
&lt;p&gt;One thing we see from the picture is that the input is reshaped three times. There is a TVM optimization pass call Common Subexpression Elimination (CSE) that combines the three reshapes.
(A while ago, this did not succeed because it had distinct shape arguments, but this was since solved by the TVM developers in the dynamic to static conversion pass.)
Also, the model parameters that are reshaped and transposed. Can we get rid of that, too?
Yes. And for that we would first &lt;em&gt;bind&lt;/em&gt; the parameters, i.e. put them into the model. Then the parameters have become constants instead of input nodes.
With the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Foldconstant&lt;/code&gt; pass, we can propagate the constants through the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;transpose&lt;/code&gt;s and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;reshape&lt;/code&gt;s to move them closer to the matmuls.&lt;/p&gt;
&lt;p&gt;After these three (which TVM will do when we compile a relay model), our model looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/bert-tvm_72_0.svg&quot; alt=&quot;svg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;And now comes an interesting trick. It is more efficient to merge the three batch matmuls with the same input into a single &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;batch_matmul&lt;/code&gt;. We implemented a pass doing this in &lt;a href=&quot;https://github.com/apache/incubator-tvm/pull/5791&quot;&gt;TVM PR 5791&lt;/a&gt;. So let’s call it and also have another constant-folding pass.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;new_mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CombineParallelBatchMatmul&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;new_mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;transform&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;FoldConstant&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;visualize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;new_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;main&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/bert-tvm_74_0.svg&quot; alt=&quot;svg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Awesome. After checking that we still get the same result.
We can time again: 25.2 ms for 100 runs. It’s a bit slow again because we need to tune for the new shapes.
After tuning, we are at 12.6ms for 100 runs, so we went from about 0.2ms to about 0.13-0.15ms, a nice speedup.
By our handwavy calculation, this should cut 0.6-0.8ms from the total runtime, or somewhere between 5%-10%. Let’s check.&lt;/p&gt;
&lt;h2 id=&quot;results-on-the-overall-bert-model-after-optimization&quot;&gt;Results on the overall BERT model after optimization&lt;/h2&gt;
&lt;p&gt;Let’s define a function combining the optimization passes from above and run it on the entire BERT model.
We go through the same exercise as above.&lt;/p&gt;
&lt;p&gt;We get to 624ms for 100 runs. So yay, we went from 6.5-7ms in PyTorch to ~6.2ms in TVM. This is a 5%-10% speedup. Note that we have only taking a particular, not very large shape. A more serious analysis would consider more problem shapes.&lt;/p&gt;
&lt;p&gt;We could probably take it a bit further yet - e.g. fusing the additions after the batch matmul by handling the reshape, but we’ll leave it at this for now. Also we will benefit from further improvements to TVM, so it will be interesting to see how the benchmark improves over time. In particular, the upcoming Ansor tuning mechanism seems promising.&lt;/p&gt;
&lt;h2 id=&quot;a-peek-under-the-hood&quot;&gt;A peek under the hood&lt;/h2&gt;
&lt;h3 id=&quot;comparing-implementation-of-models&quot;&gt;Comparing implementation of models&lt;/h3&gt;
&lt;p&gt;As you can see, I have always compared PyTorch with TVM outputs to see if they’re good.
Also, when I investigated some inner layer, I grabbed the inputs to that to convert and feed into the TVM model. I do believe that this is a very effective technique.&lt;/p&gt;
&lt;p&gt;Sometimes, however, it is difficult to assess whether a deviation between the results is from numerical accuracy or from an error somewhere.
When I initially converted the model, the the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;SelfAttention&lt;/code&gt; submodule output was replicated by the TVM model to about 1e-6.
However, the BertLayer conversion had something like 1-e3. I was not entirely clear whether that might be due to accumulated numerical errors or some material deviation somewhere.
(This turned out to be the GELU activation, which was converted to FastGELU.)&lt;/p&gt;
&lt;p&gt;One of the things I like to do in this case is jump to double precision and check there. Numerical errors should get much smaller, while other deviations would remain of the same order.
With the PyTorch frontend, you can trace the model converted to float64 on the PyTorch side if you pass &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;default_dtype=&quot;float64&quot;&lt;/code&gt; to the conversion function.&lt;/p&gt;
&lt;p&gt;Running the module and comparing to PyTorch should now have 1e-14 or so deviation.&lt;/p&gt;
&lt;h3 id=&quot;improvements-in-tvm-to-facilitate-this-usecase&quot;&gt;Improvements in TVM to facilitate this usecase&lt;/h3&gt;
&lt;p&gt;Before this worked as shown here, we had to close some gaps (but a recent git checkout will include all of them):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The TVM PyTorch converter did not support inputs other than fp32. We &lt;a href=&quot;https://github.com/t-vi/tvm/tree/pytorch_frontend_type_fix&quot;&gt;implemented improved conversion&lt;/a&gt;, now also included in TVM upsteam.&lt;/li&gt;
&lt;li&gt;The TVM schedule, i.e. the organization of the computation, of the workhorse operation, batch_matmul, was fixed and it was very slow (similar to running without a tuned schedule now). So we &lt;a href=&quot;https://github.com/apache/incubator-tvm/pull/5752&quot;&gt;implemented a tuneable schedule&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;The PyTorch converter produces batch matmul operations (it could probably also be changed to produce dense layers instead). But as we saw, one of the larger speed advantages is to combine Query Key and Value linear layers, so we implemented &lt;a href=&quot;https://github.com/apache/incubator-tvm/pull/5791&quot;&gt;fusing batch matmul operations&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;When comparing the computation results, we noticed that the &lt;a href=&quot;https://pytorch.org/docs/master/generated/torch.nn.GELU.html&quot;&gt;GELU&lt;/a&gt; function was converted to its FastGELU variant. We fixed that. (There is a &lt;em&gt;fast math&lt;/em&gt; optimization pass in TVM that does some replacement of the error function, though we didn’t check if it yields FastGELU for the GELU expressed with the error function.)&lt;/li&gt;
&lt;li&gt;TVM was initially (and still is to a some extent) focussed on static shapes. Recently it experiments with dynamic operations. The dynamic reshape - taking an argument for the target shape - is an early of these experiments, but as seen above, it prevented the fusion of batch matmuls because the common subexpression elimination pass didn’t detect that it could merge the identical input reshaping. This has improved recently.&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&quot;training-pytorch-models-with-tvm-computation&quot;&gt;Training Pytorch models with TVM computation&lt;/h1&gt;
&lt;p&gt;In this second part we want see if we could use TVM while training BERT in PyTorch.
Of course, this opens an entire new can of worms as we need to deal with autodifferentiation.
While we stay with the theme from above and take &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertLayer&lt;/code&gt; as the example, our methodology is representative of non-trivial modules in general.
We will want to divert the computation during training to TVM.&lt;/p&gt;
&lt;p&gt;So the user can take a (traceable) module and do&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;add_tvm_dispatch(module, sample_input)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;and then if she calls module with inputs of the same shape as the sample_input, she’ll get the outputs computed by TVM (as PyTorch tensors, of course) and if not, it’ll just use the regular forward.&lt;/p&gt;
&lt;p&gt;The but so we already hinted at the bad news: In this part we will see how to do these things. We will not yet achieve a great speedup.&lt;/p&gt;
&lt;p&gt;But enough talk, let us dive right in!
Again, we get our relay model with running a traced &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertLayer&lt;/code&gt; from the transformer &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Bert&lt;/code&gt; model through &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tvm.relay.frontend.from_pytorch&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;One thing we’ll do in between is to move from a modular interface in PyTorch - with named parameters - to a functional
interface (which is what TVM can do for us). The first thing we want to do for that is arrange for the function arguments to be in an order that we can work with - i.e. first the direct inputs to the module and then the parameters in the same order that PyTorch uses them. After this operation, our &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BertLayer &lt;/code&gt; in TVM looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/pytorch-tvm-training_20_0.svg&quot; alt=&quot;svg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;As in the BERT inference, we want to run some optimization passes.&lt;/p&gt;
&lt;p&gt;But we also have a few new transformations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One particularity of the Autodifferentiation is that it’ll use a lot of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;..._like&lt;/code&gt; operations to broadcast or “unbroadcast” (summation is the dual of broadcasting w.r.t. autodifferentiation) things. But this means that you now have two tensor arguments, even if the latter doesn’t really need a gradient. &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ZappLike&lt;/code&gt; replaces those operations with the corresponding functions taking a shape parameter instead.&lt;/li&gt;
&lt;li&gt;Another thing is the “rooting” of derivatives. TVM generates a tensors with all ones of the same shape as the return values of our function as the starting point for the chain rule. These are then multiplied to the derivatives of our operations. But multiplication with ones is not doing much, so we strike that. Similarly, TVM initializes the gradient of a variable (an input) to zeros of the same shape. If it isn’t used, the gradient will be zero, but if it is, the “real gradient” will be added to that zero. But adding zero can be eliminated as well. These are taken care off by ZeroZapp and OneZapp.&lt;/li&gt;
&lt;li&gt;TVM doesn’t have a training variant for the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;LayerNorm&lt;/code&gt; (or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;BatchNorm&lt;/code&gt; or others). So we implement a pass to spell out the computation.&lt;/li&gt;
&lt;li&gt;TVM also doesn’t have training dropout. Here the problem is somewhat harder to fix, as TVM doesn’t have random currently. We instead replace the dropout by a construct taking a random bernoulli draw (of 0/1 values) and mimicking dropout with that. The idea is that we’ll use PyTorch to generate this mask for us. This has the added benefit that (if we generate dropout masks in the same order as PyTorch) we’ll get the exact same result.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;As hinted at above, TVM’s gradient taking assumes that it is the last element in the computation (the ones-Tensors discussed above). This isn’t a good fit with PyTorch’s modular view which expects a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;grad_out&lt;/code&gt; for each output to be given. Happily, this is computationally equivalent to multiplying by grad out and summation, so we amend our function with that. We wish to be flexible, so we allow both functions returning a single tensor and those returning a tuple of tensors.&lt;/p&gt;
&lt;p&gt;With these modificaitons applied, our model looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/pytorch-tvm-training_25_0.svg&quot; alt=&quot;svg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;Finally we can take the grad. As we get a lot of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;let&lt;/code&gt; nodes, we bring it to normal form using the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;ToGraphNormalForm&lt;/code&gt; pass.
TVM’s gradient-taking returns a function that has the same parameters as the original function (in our case amended with the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;grad_out&lt;/code&gt; and dropout) and then returns a tuple of the original return and a tuple containing gradients for all inputs.
The first thing we do is to drop all the gradients for &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;grad_out&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dropout&lt;/code&gt; which we don’t need.
Then we run our simplification passes.&lt;/p&gt;
&lt;p&gt;So this is the graph we have now for forward and backward:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/pytorch-tvm-training_31_0.svg&quot; alt=&quot;svg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;But in PyTorch, we first compute the forward and then the backwards, so we have to take out the saw and
split our graph. One of the difficult problems is what to do with things computed for both forward and backward. It is a hard problem, related to the MinCut problem.&lt;/p&gt;
&lt;p&gt;Our extremal options could be:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;One could only keep the inputs and recompute everything as needed.&lt;/li&gt;
&lt;li&gt;If we had a salar output, we could compute the gradient and multiply with the derivative of the later layers on backward. (Loss functions might do that.) This does not, however, work for non-scalar tensor outputs.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We’ll do the following: We compute the forward normally, but we keep all things that will be used in the backward. This is too much, unfortunately, and it is very likely the reason we don’t see an end to end speedup. We’ll discuss some potential heuristics below.&lt;/p&gt;
&lt;p&gt;We use a coloring here. First we color all nodes of the forward computation in red. Then we traverse the gradient calculation and then color the nodes it needs from the backward blue. This gives us a chance to show off the attribute support in our visualization.&lt;/p&gt;
&lt;p&gt;A bit of (PyTorch) terminology: When we have a function &lt;em&gt;Layer : x ↦ y&lt;/em&gt; followed by some &lt;em&gt;Loss: y ↦ l ∈ ℝ&lt;/em&gt;, the backward is &lt;em&gt;BackwardOfLayer : grad&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_&lt;/code&gt;out ↦ grad&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_&lt;/code&gt;in&lt;/em&gt; with &lt;em&gt;grad&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_&lt;/code&gt;out = dl/dy&lt;/em&gt; and *grad&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;_&lt;/code&gt;in = dl/dx`.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/pytorch-tvm-training_34_0.svg&quot; alt=&quot;svg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;In order to split the function as described above, we collect the blue nodes as to capture - but constants will
just be duplicated and inputs (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Var&lt;/code&gt; nodes) need to be treated separately.
Now we can split out the backward, replacing all the blue nodes with variables.&lt;/p&gt;
&lt;p&gt;Next we take the forward and amend it to also return the required intermediates. The forward then looks like this:&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;/images/bert-pytorch/pytorch-tvm-training_40_0.svg&quot; alt=&quot;svg&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;TVM cannot return nested tuples, so we flatten the output in the function. Again we differentiate between tensor-valued functions and tuple valued ones (i.e. those returning potentially multiple tensors).&lt;/p&gt;
&lt;p&gt;And at last, we can let TVM do its magic and compile our functions, say to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gr_only_compiled_module&lt;/code&gt;
and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;fw_and_cap_compiled_module&lt;/code&gt;.
Time to give it a spin. We define convenience functions to move tensors between PyTorch and TVM and get the model parameters as a TVM dictionary.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;tensor_to_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_dlpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;utils&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dlpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_dlpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;def&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;tensor_from_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;return&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;utils&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dlpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_dlpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;a&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;to_dlpack&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;model_params_tvm&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tensor_to_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pytorch_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;state_dict&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Similarly, we get the inputs on the GPU in PyTorch and TVM.&lt;/p&gt;
&lt;p&gt;We need to deal with the dropout. It will turn out that our record of the three dropout random draws happens in the same order as the dropout in the model. We did a depth-first search on the computational graph to find them and if the values of the the dropout are connected in the graph rather than being on independent branches, this will be the order in which PyTorch draws the matrices, too. If not, good luck fiddeling with the order.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;manual_seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12345&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;drop_c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{}&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;k&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dropout_info&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;keys&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# we don't know the order
&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typ&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dropout_info&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;drop_c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;k&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;functional&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dropout&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ones&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;getattr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;typ&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;drop_tvm&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tensor_to_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;n&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;t&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;drop_c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;items&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Now we can run the forward.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;fw_and_cap_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'input'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inp_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fw_and_cap_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'attention_mask'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inp_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;])&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fw_and_cap_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_params_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fw_and_cap_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;drop_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fw_and_cap_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And we can compare the output to PyTorch’s:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;manual_seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12345&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;pytorch_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pytorch_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inp_c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;abs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fw_and_cap_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;asnumpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;detach&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cpu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()).&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;This gives &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;2.1457672e-06&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Supergood. Let’s also try the backward. We generate a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;grad_out&lt;/code&gt;, set all the variables and run the backward model and run the backward model&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;gr_out_c&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;randn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;device&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;cuda&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;dtype&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;num_captures&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;capture_vars&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;num_regular_outputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fw_and_cap_fn_flattened&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;body&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fields&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_captures&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;captured_values&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;v&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;name_hint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;fw_and_cap_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;num_regular_outputs&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;v&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;enumerate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;capture_vars&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gr_only_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;drop_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gr_only_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_params_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gr_only_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;**&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;captured_values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gr_only_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;set_input&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'gr:out:0'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tensor_to_tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gr_out_c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;gr_only_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;On the PyTorch side, it is easiest to re-run the forward (remembering to reset the random seed) and get the grads.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;manual_seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;12345&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;pytorch_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;train&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;inp_c_rq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requires_grad_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inp_c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;p&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pytorch_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;():&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;p&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;requires_grad_&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;pytorch_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inp_c_rq&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)[&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;grads_pt&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;torch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;autograd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grad&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;res&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inp_c_rq&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;+&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;list&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;pytorch_model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;parameters&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gr_out_c&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;allow_unused&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Did it work? It seems so:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;for&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;g_pt&lt;/span&gt; &lt;span class=&quot;ow&quot;&gt;in&lt;/span&gt; &lt;span class=&quot;nb&quot;&gt;enumerate&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;grads_pt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;abs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gr_only_compiled_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;asnumpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;-&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;g_pt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cpu&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;().&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;numpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()).&lt;/span&gt;&lt;span class=&quot;nb&quot;&gt;max&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;gives us a list of numbers in the 1e-5ish range.&lt;/p&gt;
&lt;p&gt;But we wanted to get something running in PyTorch, right?&lt;/p&gt;
&lt;p&gt;Keeping with how PyTorch works, we first define an &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;autograd.Function&lt;/code&gt; that the things we just did manually:&lt;/p&gt;
&lt;p&gt;In the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;forward&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Generate the dropout random values,&lt;/li&gt;
&lt;li&gt;Run the forward,&lt;/li&gt;
&lt;li&gt;Record the captures, inputs, and dropout values needed for backward.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;backward&lt;/code&gt;, run the backward and return the result (as PyTorch tensors).&lt;/p&gt;
&lt;p&gt;With that, we get a PyTorch autograd.Function calling into TVM (we would want a small wrapper for that.&lt;/p&gt;
&lt;p&gt;Now all we need to do to achive our goal of getting a method &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;add_tvm_dispatch(module, sample_inputs)&lt;/code&gt; is
to trace the module, create the TVM-based autograd function from it and then replace the forward that calls
that (with the parameters) if applicable or falls back to the usual forward.
Python’s unlimited dynamism makes that kind of hackery relatively easy.
As all this it is not really TVM-related, we are sparing us that here (but you could check the
&lt;a href=&quot;https://lernapparat.de/transformers-pytorch-tvm/&quot;&gt;companion post&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
&lt;p&gt;As I said in the beginning, we aren’t quite where we want to eventually be in terms of performance.
After tuning the tasks (and on the not very realistic inference example from the HuggingFace BERT + PyTorch JIT tutorial)
we run 100 iterations of the TVM-enabled BertLayer forward and backward similar to how we did it for the inference.
One iteration takes 6.2ms going through TVM versus 1.3ms on PyTorch.&lt;/p&gt;
&lt;p&gt;So ran our model through TVM all right. But it’s not as fast as the usual method yet. Here is to opportunity!&lt;/p&gt;
&lt;p&gt;More seriously, we have two immediate paths to improve performance:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Find a better set of captured nodes.&lt;/li&gt;
&lt;li&gt;Find optimizations on the TVM graph.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In terms of heuristics for the former (remember that it quite likely NP hard, i.e. I believe it is, but I didn’t work out a formal proof),
one would want to re-do cheap computation, most prominently point-wise computation (or maybe anything but matmul?). But that is for another day.&lt;/p&gt;
&lt;p&gt;I hope you enjoyed the tutorial, I look forward to your comments at &lt;a href=&quot;mailto:tv@lernapparat.de&quot;&gt;tv@lernapparat.de&lt;/a&gt;.&lt;/p&gt;
&lt;h1 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h1&gt;
&lt;p&gt;I had many interesting discussions with HugingFace people and Morgan Funtowicz in particular. Also the TVM contributors had many good comments during the review of the patches TVM and on the forums. The creation of this tutorial was sponsored by AMD.&lt;/p&gt;
&lt;h1 id=&quot;author&quot;&gt;Author&lt;/h1&gt;
&lt;p&gt;&lt;a href=&quot;https://lernapparat.de/&quot;&gt;Thomas Viehmann&lt;/a&gt; is the founder of &lt;a href=&quot;https://mathinf.eu/&quot;&gt;MathInf GmbH&lt;/a&gt;, Munich, Germany, a boutique training and consultancy firm focusing on Machine Learning and PyTorch.
He is a PyTorch core developer and co-authored &lt;a href=&quot;https://www.manning.com/books/deep-learning-with-pytorch&quot;&gt;Deep Learning with PyTorch&lt;/a&gt;, which currently available as &lt;a href=&quot;https://pytorch.org/deep-learning-with-pytorch&quot;&gt;free download from the PyTorch website&lt;/a&gt;.&lt;/p&gt;
</content>
</entry>
<entry>
<title>TinyML - How TVM is Taming Tiny</title>
<link href="https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny"/>
<updated>2020-06-04T00:00:00-04:00</updated>
<id>https://tvm.apache.org/2020/06/04/tinyml-how-tvm-is-taming-tiny</id>
<content type="html">
&lt;p&gt;&lt;img src=&quot;/images/microtvm/logo.png&quot; alt=&quot;microTVM logo&quot; width=&quot;30%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;The proliferation of low-cost, AI-powered consumer devices has led to widespread interest in “bare-metal” (low-power, often without an operating system) devices among ML researchers and practitioners. While it is already possible for experts to run &lt;em&gt;some&lt;/em&gt; models on &lt;em&gt;some&lt;/em&gt; bare-metal devices, optimizing models for diverse sets of devices is challenging, often requiring manually optimized device-specific libraries. And for those platforms without, say, Linux support, there exists no scalable solution for deploying models. Because of this, in order to target new devices, developers must implement one-off custom software stacks for managing system resources and scheduling model execution.&lt;/p&gt;
&lt;p&gt;The manual optimization of machine learning software is not unique to the domain of bare-metal devices. In fact, this has been a common theme for developers working with other hardware backends (e.g., GPUs and FPGAs). TVM has proven resilient to the onslaught of new hardware targets, but until now, it couldn’t grapple with the unique profile of microcontrollers. To solve the problem in this domain, we’ve extended TVM to feature a microcontroller backend, called µTVM (footnote: pronounced “MicroTVM”). µTVM facilitates host-driven execution of tensor programs on bare-metal devices and enables automatic optimization of these programs via AutoTVM, TVM’s built-in tensor program optimizer. In the figure below, a bird’s eye view of the µTVM + AutoTVM infrastructure is shown:&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/autotvm-infrastructure.png&quot; alt=&quot;/images/microtvm/autotvm-infrastructure.png&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h1 id=&quot;lets-see-it-in-action&quot;&gt;Let’s see it in action&lt;/h1&gt;
&lt;p&gt;Before we talk about what TVM/MicroTVM is or how it works, let’s see a quick example of it in action.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/hardware-connection-diagram.png&quot; alt=&quot;/images/microtvm/hardware-connection-diagram.png&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;
A standard µTVM setup, where the host communicates with the device via JTAG.&lt;/p&gt;
&lt;p&gt;Above, we have an &lt;a href=&quot;https://www.st.com/en/microcontrollers-microprocessors/stm32f746zg.html&quot;&gt;STM32F746ZG board&lt;/a&gt;, housing an ARM Cortex-M7 processor, an ideal part for AI on the edge given it’s strong performance in a low power envelope. We use its USB-JTAG port to connect it to our desktop machine. On the desktop, we run OpenOCD to open a JTAG connection with the device; in turn, OpenOCD allows µTVM to control the M7 processor using a device-agnostic TCP socket. With this setup in place, we can run a CIFAR-10 classifier using TVM code that looks like this (full script &lt;a href=&quot;https://github.com/areusch/microtvm-blogpost-eval/blob/master/python/micro_eval/bin/eval.py&quot;&gt;here&lt;/a&gt;):&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;OPENOCD_SERVER_ADDR&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'127.0.0.1'&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;OPENOCD_SERVER_PORT&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6666&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TARGET&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'c -device=micro_dev'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;DEV_CONFIG&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;stm32f746xx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;default_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;OPENOCD_SERVER_ADDR&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;OPENOCD_SERVER_PORT&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_cifar10_cnn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Session&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sess&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'main'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TARGET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;micro_mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_micro_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DEV_CONFIG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;graph_mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graph_runtime&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;micro_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;micro_dev&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;graph_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;run&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;CIFAR10_CLASSES&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;argmax&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;graph_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;get_output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;).&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;asnumpy&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;())]&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;print&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;sa&quot;&gt;f&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'prediction was &lt;/span&gt;&lt;span class=&quot;si&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;prediction&lt;/span&gt;&lt;span class=&quot;si&quot;&gt;}&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Below are the performance results of MicroTVM, compared with &lt;a href=&quot;https://github.com/ARM-software/CMSIS_5/releases/tag/5.6.0&quot;&gt;CMSIS-NN version 5.7.0&lt;/a&gt; (commit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;a65b7c9a&lt;/code&gt;), a hand-optimized library of ML kernels.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png&quot; alt=&quot;/images/microtvm/post-2020-05-28/cifar10-int-8-cnn.png&quot; width=&quot;60%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;As we can see, the out-of-the-box performance isn’t great, but this is where &lt;a href=&quot;https://dl.acm.org/doi/10.5555/3327144.3327258&quot;&gt;AutoTVM&lt;/a&gt; comes to the rescue. We can write a schedule template for our device, do a round of autotuning, then achieve significantly better results. To plug in our autotuned results, we only need to replace this line:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'main'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TARGET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;with these lines:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;TARGET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;autotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;apply_history_best&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TUNING_RESULTS_FILE&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'main'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TARGET&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;And our results now look like this:&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot; alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot; width=&quot;60%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;We’ve improved our performance by ~2x, and we’re now much closer to CMSIS-NN. Although the MicroTVM CIFAR10 implementation is competitive in with a similar TFLite/CMSIS-NN model, this work has just begun to take advantage of TVM’s optimization features. There’s room to optimize further by accelerating other operators such as dense/fully-connected and taking advantage of TVM’s model-specific quantization and operator fusion capabilities. TVM with µTVM enables you to play with the best of them. So how does it work? What’s going on behind the scenes? Let’s dive in now.&lt;/p&gt;
&lt;h1 id=&quot;design&quot;&gt;Design&lt;/h1&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/post-2020-05-28/memory-layout.png&quot; alt=&quot;/images/microtvm/post-2020-05-28/memory-layout.png&quot; width=&quot;20%&quot; /&gt;&lt;br /&gt;
The µTVM Device Memory Layout in RAM&lt;/p&gt;
&lt;p&gt;µTVM aims to support the lowest common denominator of devices by minimizing the set of requirements that must be satisfied. In particular, users need only provide:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;a C cross-compiler toolchain for their device&lt;/li&gt;
&lt;li&gt;a method for reading/writing to device memory and executing code on the device&lt;/li&gt;
&lt;li&gt;a specification containing the device’s memory layout and general architectural characteristics&lt;/li&gt;
&lt;li&gt;a code snippet that prepares the device for function execution&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Most bare-metal devices have support for C and JTAG (a debugging protocol), so (1) and (2) usually come for free! Furthermore, (3) and (4) are often very small asks. Below are examples of (3) and (4) for STM32F746-series boards.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;device_config&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;'device_id'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'arm.stm32f746xx'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# unique identifier for the device
&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'toolchain_prefix'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'arm-none-eabi-'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# prefix of each binary in the cross-compilation toolchain (e.g., arm-none-eabi-gcc)
&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'base_addr'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mh&quot;&gt;0x20000000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# first address of RAM
&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'section_sizes'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# dictionary of desired section sizes in bytes
&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'text'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;18000&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;'rodata'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;'data'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;100&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;},&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;'word_size'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;4&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# device word size
&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'thumb_mode'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;bp&quot;&gt;True&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# whether to use ARM's thumb ISA
&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'comms_method'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'openocd'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# method of communication with the device
&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'server_addr'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'127.0.0.1'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# OpenOCD server address (if 'comms_method' is 'openocd')
&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;'server_port'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;6666&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;c1&quot;&gt;# OpenOCD server port (if 'comms_method' is 'openocd')
&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;syntax&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;unified&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;cpu&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;cortex&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;m7&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;fpu&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;softvfp&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;thumb&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;section&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;text&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;type&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;%&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;function&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;cm&quot;&gt;/* enable fpu */&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0xE000ED88&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r2&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mh&quot;&gt;0xF00000&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;orr&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r2&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;str&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;r1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;r0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;dsb&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;isb&lt;/span&gt;
&lt;span class=&quot;cm&quot;&gt;/* set stack pointer */&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ldr&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;_utvm_stack_pointer_init&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bl&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UTVMMain&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;size&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;-&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;UTVMInit&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The µTVM infrastructure and device runtime have been built to only make use of these requirements, and we’re working to lessen these requirements by supporting common open source runtime platforms such as mBED OS to handle the compilation and linking processes.&lt;/p&gt;
&lt;h2 id=&quot;device-sessions&quot;&gt;Device Sessions&lt;/h2&gt;
&lt;p&gt;Given the networked nature of microcontroller interaction, we slightly deviate from standard TVM code by introducing the concept of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;MicroSession&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Every piece of functionality in µTVM relies on having an open session with the target device. If you’re familiar with TVM, you may have noticed a line of code that deviates from the norm in our first code snippet—-namely, this one:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Session&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;device_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;as&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;sess&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Every line inside this &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;with&lt;/code&gt; block can call functions in µTVM, with the context being the device specified by &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;device_config&lt;/code&gt;. This line is doing a number of things under the hood, so let’s unpack it.&lt;/p&gt;
&lt;p&gt;First, it initializes a connection with your device, using whichever communication method you specified (usually OpenOCD). The µTVM device runtime is then cross-compiled, using whichever cross-compiler you specified. Finally, space for the compiled binary is allocated by the host, and the binary is loaded onto the device using the opened connection.&lt;/p&gt;
&lt;p&gt;With the runtime now situated on the device, we’ll naturally want some functions to run through it.&lt;/p&gt;
&lt;h2 id=&quot;module-loading&quot;&gt;Module Loading&lt;/h2&gt;
&lt;p&gt;One of the core abstractions in TVM is that of a module. A module stores a set of related functions for a particular device/runtime target. Given that microcontrollers don’t normally have operating systems, µTVM needs to do a lot of extra work to maintain this high-level abstraction. To see what’s going on, we’ll trace through the process of creating and loading a µTVM-compatible module.&lt;/p&gt;
&lt;p&gt;Suppose we have a &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;micro.Session&lt;/code&gt; open with our device and a TVM schedule that implements 2D convolution. If we want to load it onto our microcontroller, we need it to emit C code. To do so, we just need to set the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;target&lt;/code&gt; in either &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tvm.build&lt;/code&gt; or &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;relay.build&lt;/code&gt;. Example:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'main'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'c -device=micro_dev'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;By setting the target like so, the build process runs through our C code generation backend. However, the resulting C module still resides on the host machine. In order to load it onto the device, we run it through one of the core functions in the µTVM infrastructure: &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;create_micro_mod&lt;/code&gt;. Example:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;micro_mod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;micro&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;create_micro_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;c_module&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;DEV_CONFIG&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The line above cross-compiles the C source within the module, allocates room for the resulting binary (so it can coexist with the runtime in device memory), then sends each section of the binary to its allocated slot on the device. Once the module binary is snug in device memory, function pointers within the binary are patched to give the module access to helper functions in the device runtime (e.g., for allocating scratchpads).&lt;/p&gt;
&lt;p&gt;Now, with our kernel loaded on the device, we can grab a remote handle to the convolution function like so:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;micro_func&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;micro_mod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'conv2d'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;h2 id=&quot;tensor-loading&quot;&gt;Tensor Loading&lt;/h2&gt;
&lt;p&gt;If we want to call an operator, we first need some tensors as arguments:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kernel_np&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;get_conv_inputs&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;micro_dev&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data_np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;kernel&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;tvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;nd&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;array&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;kernel_np&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ctx&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Based on its data type (e.g., &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int8&lt;/code&gt;, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float32&lt;/code&gt;, etc.) and shape, each tensor’s size in bytes is calculated, and the host allocates a region of memory on the device’s heap. The tensor’s data is then loaded into the allocated region.&lt;/p&gt;
&lt;h2 id=&quot;function-calls&quot;&gt;Function Calls&lt;/h2&gt;
&lt;p&gt;Operator execution is perhaps the trickiest part of this system. To simplify its presentation, we’ll first cover strict execution (where operators are executed as soon as they’re called), then lazy execution (where operators are only executed once their results are needed)—-the latter is how the system actually works.&lt;/p&gt;
&lt;h3 id=&quot;strict-execution&quot;&gt;Strict Execution&lt;/h3&gt;
&lt;p&gt;When calling a function, both input and output tensors are passed as arguments, in what’s known as destination-passing style:&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;conv2D&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;data&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;kernel&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;output&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Given that these tensors are already allocated on the device, we only need to send &lt;em&gt;metadata&lt;/em&gt; to the device (device address, shape, and data type), so it knows which of its resident tensors to use. The runtime representation of a function call includes this metadata, as well as the address of the function being called (shown below). Before constructing this representation, the metadata needs to be serialized into the arguments section on the device that exists expressly for this purpose.&lt;/p&gt;
&lt;div class=&quot;language-c highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;cm&quot;&gt;/*
* task struct for uTVM
*/&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;typedef&lt;/span&gt; &lt;span class=&quot;k&quot;&gt;struct&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;cm&quot;&gt;/* pointer to function to call for this task */&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int32_t&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;void&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int32_t&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;);&lt;/span&gt;
&lt;span class=&quot;cm&quot;&gt;/* array of argument tensors */&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;TVMValue&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg_values&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;cm&quot;&gt;/* array of datatype codes for each argument */&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg_type_codes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;cm&quot;&gt;/* number of arguments */&lt;/span&gt;
&lt;span class=&quot;kt&quot;&gt;int32_t&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;num_args&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;UTVMTask&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;In the strict setting, there is a single global &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; instance that we, from the host side, write into. Once we have written to the task, the runtime has everything it needs to execute the function, and we can begin execution at the runtime’s entry point. The runtime will perform some lightweight initialization, run our operator, then return control to the host.&lt;/p&gt;
&lt;h3 id=&quot;lazy-execution&quot;&gt;Lazy Execution&lt;/h3&gt;
&lt;p&gt;In practice, executing operators as soon as the user requests to becomes prohibitively expensive, as communication overhead begins to dominate. We can improve the throughput of our system by delaying evaluation until the user wants the results of the call.&lt;/p&gt;
&lt;p&gt;From an implementation standpoint, instead of eagerly serializing argument metadata and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; data, we now need to accumulate function call metadata on the host side, before flushing it to the device. The device runtime also needs a few changes: (1) we must now have a global array of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;UTVMTask&lt;/code&gt; and (2) we need to loop through and execute each task in order.&lt;/p&gt;
&lt;h2 id=&quot;autotvm-with-microtvm&quot;&gt;AutoTVM with MicroTVM&lt;/h2&gt;
&lt;p&gt;So far, the runtime we’ve described doesn’t seem very useful for &lt;em&gt;model deployment&lt;/em&gt;, since it relies so heavily on a host machine. This is intentional, and the runtime has in fact been designed for a different goal: &lt;strong&gt;AutoTVM support&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;In general, AutoTVM proposes candidate kernels, runs them on the target backend with random inputs, then uses the timing results to improve its search process. Given that AutoTVM only cares about single operator executions, we have designed the runtime to be operator-oriented, as opposed to being model-oriented. In the case of µTVM though, communication with the device will usually dominate the execution time. Lazy execution allows us to run the same operator many times without returning control to the host, so the communication cost is amortized over each run, and we can get a better idea of the performance profile.&lt;/p&gt;
&lt;p&gt;Because AutoTVM requires rapid iteration on large numbers of candidate kernels, µTVM infrastructure only makes use of RAM currently. However, for a self-hosted runtime, we will surely need to make use of both flash memory and RAM.&lt;/p&gt;
&lt;h2 id=&quot;the-hosted-graph-runtime&quot;&gt;The Hosted Graph Runtime&lt;/h2&gt;
&lt;p&gt;Although the hosted runtime was designed for AutoTVM, we can still run full models (as long as they don’t have any control flow). This functionality comes for free just by using TVM’s graph runtime, but with a µTVM context. In fact, the only reliance on the host with the graph runtime is for tensor allocation and operator scheduling (which is just a topological sort of the dependence graph).&lt;/p&gt;
&lt;h1 id=&quot;evaluation&quot;&gt;Evaluation&lt;/h1&gt;
&lt;p&gt;With this infrastructure in place, we sought to answer the following questions:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Is µTVM truly device-agnostic?&lt;/li&gt;
&lt;li&gt;How much effort is required to experiment with optimizations using µTVM?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;To evaluate (1), we ran our experiments on two targets:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;An &lt;a href=&quot;https://www.st.com/en/microcontrollers-microprocessors/stm32f746ng.html&quot;&gt;Arm STM32F746NG development board&lt;/a&gt;, featuring a Cortex-M7 processor&lt;/li&gt;
&lt;li&gt;The µTVM host emulated device, which creates a memory arena on the host machine that is interfaced with as if it is a bare-metal device.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;To evaluate (2), we explore optimizations for the Arm board that give the biggest bang for your buck.&lt;/p&gt;
&lt;p&gt;As a point of comparison, we pulled a quantized CIFAR-10 CNN from &lt;a href=&quot;https://developer.arm.com/solutions/machine-learning-on-arm/developer-material/how-to-guides/image-recognition-on-arm-cortex-m-with-cmsis-nn/single-page&quot;&gt;this tutorial by Arm&lt;/a&gt;. In the tutorial, &lt;a href=&quot;https://arm-software.github.io/CMSIS_5/NN/html/index.html&quot;&gt;CMSIS-NN&lt;/a&gt; (a library of highly optimized kernels by Arm experts) is used as the operator library, making this CNN the perfect evaluation target, as we could now directly compare the results of µTVM with CMSIS-NN on the Arm board.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/post-2020-05-28/cifar10-graphical.png&quot; alt=&quot;/images/microtvm/post-2020-05-28/cifar10-graphical.png&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;
Diagram of CIFAR-10 CNN&lt;/p&gt;
&lt;h2 id=&quot;methodology&quot;&gt;Methodology&lt;/h2&gt;
&lt;p&gt;In our experiments, we use TVM from HEAD (commit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;9fa8341&lt;/code&gt;), version 5.7.0 of CMSIS-NN (commit &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;a65b7c9a&lt;/code&gt;), version 1.16.0 of STM32CubeF7, and GCC from Arm’s GNU Tools for Arm Embedded Processors 9-2019-q4-major 9.2.1 toolchain (revision 277599). The host machine used in our experiments runs Ubuntu Linux 18.04.4 LTS and sports an AMD Ryzen Threadripper 2990WX 32-Core Processor with 62GB of RAM. All evaluation scripts for this blogpost are contained in &lt;a href=&quot;https://github.com/areusch/microtvm-blogpost-eval&quot;&gt;this repo&lt;/a&gt;.&lt;/p&gt;
&lt;h3 id=&quot;arm-specific-optimizations&quot;&gt;Arm-Specific Optimizations&lt;/h3&gt;
&lt;p&gt;With CMSIS-NN, the first convolution maps to their &lt;a href=&quot;https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/Source/ConvolutionFunctions/arm_convolve_HWC_q7_RGB.c&quot;&gt;RGB convolution implementation&lt;/a&gt; (specifically for usage in input layers) and the latter two map to their &lt;a href=&quot;https://github.com/ARM-software/CMSIS_5/blob/develop/CMSIS/NN/Source/ConvolutionFunctions/arm_convolve_HWC_q7_fast.c&quot;&gt;“fast” convolution implementation&lt;/a&gt;. We felt our performance was close enough for the RGB convolution after the earlier generic optimizations, but were left unsatisfied with our fast convolution results. Luckily, Arm released a &lt;a href=&quot;https://arxiv.org/abs/1801.06601&quot;&gt;paper&lt;/a&gt; describing optimizations used in CMSIS-NN, and we found they are getting massive speedups from SIMD intrinsics. In the paper, they present a matrix multiplication microkernel that uses SIMD intrinsics (figure below). While we could add first-class support for the intrinsics in TVM’s code generation facilities—and this is likely the best move in the long run—TVM offers &lt;a href=&quot;https://tvm.apache.org/docs/tutorials/language/tensorize.html&quot;&gt;tensorization&lt;/a&gt; as a “quick-and-dirty” solution to supporting SIMD.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/post-2020-05-28/simd-diagram.png&quot; alt=&quot;/images/microtvm/post-2020-05-28/simd-diagram.png&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;
Diagram from CMSIS-NN paper showing a 2x2 matrix multiplication microkernel&lt;/p&gt;
&lt;p&gt;Tensorization works by defining a microkernel that can be inserted into the innermost loop of a TVM operator. Using this mechanism, adding SIMD support for the Arm board was as simple as defining a microkernel in C (found &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8d7249688771bb6806595931586d95648036f383/topi/python/topi/arm_cpu/cortex_m7/micro_kernel/gemm.py&quot;&gt;here&lt;/a&gt;) that mirrored the implementation in their paper. We defined a schedule that used this microkernel (found &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/8d7249688771bb6806595931586d95648036f383/topi/python/topi/arm_cpu/cortex_m7/conv2d/direct_simd.py&quot;&gt;here&lt;/a&gt;), autotuned it, then got the “µTVM SIMD tuned” results.&lt;/p&gt;
&lt;p&gt;While we were able to use the SIMD microkernel for direct convolution, CMSIS-NN uses what they call “partial im2col” as their implementation strategy, which offers a tradeoff between performance and memory usage. Instead of manifesting the entire im2col matrix at once, partial im2col generates only a few columns at a time. Then, with each batch, they can send the matrix to their SIMD matmul function.&lt;/p&gt;
&lt;p&gt;Our hypothesis was that, among other optimizations, we could find the optimal batch size via autotuning. In practice, we found partial im2col to be significantly slower than our direct convolution implementation, so we don’t include it in the rest of our results.&lt;/p&gt;
&lt;p&gt;There are certainly other optimizations we could pull from CMSIS-NN to close the gap even further:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Batch expansion of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int8&lt;/code&gt; weights into &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int16&lt;/code&gt;, to cut down on duplicate expansion for SIMD&lt;/li&gt;
&lt;li&gt;Splitting convolution into 3x3 tiles to reduce padding checks&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;But our goal in this blog post is to show the broad strokes of what can be done with µTVM. Even so, it’s not a competition, because CMSIS-NN (and any other hand-optimized library) can plug directly into TVM using the &lt;a href=&quot;https://tvm.apache.org/docs/dev/relay_bring_your_own_codegen.html&quot;&gt;Bring Your Own Codegen framework&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;end-to-end&quot;&gt;End-To-End&lt;/h2&gt;
&lt;h3 id=&quot;cifar-10&quot;&gt;CIFAR-10&lt;/h3&gt;
&lt;p&gt;After exploring optimizations for convolution, we set out to measure their effects on end-to-end performance. For the Arm board, we collected untuned results, results that were tuned &lt;strong&gt;without&lt;/strong&gt; any use of SIMD, results that were tuned &lt;strong&gt;with&lt;/strong&gt; SIMD, and results using CMSIS-NN. For the emulated host device, we only collected untuned results and generic tuned results.&lt;/p&gt;
&lt;p&gt;&lt;a href=&quot;https://github.com/areusch/microtvm-blogpost-eval&quot;&gt;https://github.com/areusch/microtvm-blogpost-eval&lt;/a&gt;&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot; alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn.png&quot; width=&quot;60%&quot; /&gt;&lt;br /&gt;
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int8&lt;/code&gt;-quantized CIFAR-10 CNN comparison on an Arm STM32F746NG (re-posted from above)&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png&quot; alt=&quot;/images/microtvm/post-2020-05-28/autotuned-cifar10-int-8-cnn-x86.png&quot; width=&quot;60%&quot; /&gt;&lt;br /&gt;
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int8&lt;/code&gt;-quantized CIFAR-10 CNN comparison on µTVM’s emulated host device&lt;/p&gt;
&lt;p&gt;On the Arm STM32-series board, we were able to improve performance by ~2x compared to the initial untuned operators, and we achieved results much closer to CMSIS-NN. Additionally, we were able to significantly improve performance on the host emulated device. Though the x86 &lt;strong&gt;&lt;em&gt;numbers&lt;/em&gt;&lt;/strong&gt; don’t mean much, they show we can use the same infrastructure (µTVM) to optimize performance on vastly different architectures.&lt;/p&gt;
&lt;p&gt;Stay tuned in the future for more end-to-end benchmarks as we scale this approach out more broadly.&lt;/p&gt;
&lt;h1 id=&quot;self-hosted-runtime-the-final-frontier&quot;&gt;Self-Hosted Runtime: The Final Frontier&lt;/h1&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/microtvm/self-hosted-runtime.png&quot; alt=&quot;/images/microtvm/self-hosted-runtime.png&quot; width=&quot;80%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;The envisioned µTVM optimization and deployment pipeline&lt;/p&gt;
&lt;p&gt;While end-to-end benchmark results are already obtainable with the current runtime as we demonstrated above, deployment of these models in a standalone capacity is currently still on our roadmap. The gap being that the AutoTVM-oriented runtime currently relies on the host to allocate tensors and to schedule function execution. However, to be useful at the edge, we need a pipeline through µTVM that generates a &lt;strong&gt;single&lt;/strong&gt; binary to be run on a bare-metal device. Users will then be able to easily integrate fast ML into their applications by including this binary in their edge application. Each stage of this pipeline is already in place, and now it’s just a matter of gluing it all together, so expect updates from us soon on this front.&lt;/p&gt;
&lt;h1 id=&quot;conclusion&quot;&gt;Conclusion&lt;/h1&gt;
&lt;p&gt;MicroTVM for single-kernel optimization is ready &lt;strong&gt;today&lt;/strong&gt; and is &lt;em&gt;the&lt;/em&gt; choice for that use case. As we now build out self-hosted deployment support we hope you’re just as excited as we are to make µTVM &lt;em&gt;the&lt;/em&gt; choice for model deployment as well. However, this isn’t just a spectator sport - remember: this is all open source! µTVM is still in its early days, so every individual can have a great deal of impact on its trajectory. Check out the &lt;a href=&quot;https://tvm.apache.org/docs/contribute/&quot;&gt;TVM contributor’s guide&lt;/a&gt; if you’re interested in building with us or jump straight into &lt;a href=&quot;https://discuss.tvm.ai/&quot;&gt;the TVM forums&lt;/a&gt; to discuss ideas first.&lt;/p&gt;
&lt;h2 id=&quot;acknowledgements&quot;&gt;Acknowledgements&lt;/h2&gt;
&lt;p&gt;None of this work would have been possible, if not for the following people:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://tqchen.com/&quot;&gt;Tianqi Chen&lt;/a&gt;, for guiding the design and for being a fantastic mentor.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://homes.cs.washington.edu/~patelp1/&quot;&gt;Pratyush Patel&lt;/a&gt;, for collaborating on early prototypes of MicroTVM.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://octoml.ai/&quot;&gt;OctoML&lt;/a&gt;, for facilitating the internships where I have been able to go full steam on this project.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://homes.cs.washington.edu/~moreau/&quot;&gt;Thierry Moreau&lt;/a&gt;, for mentoring me during my time at OctoML.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://homes.cs.washington.edu/~vegaluis/&quot;&gt;Luis Vega&lt;/a&gt;, for teaching me the fundamentals of interacting with microcontrollers.&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://www.linkedin.com/in/themadrasi/?originalSubdomain=uk&quot;&gt;Ramana Radhakrishnan&lt;/a&gt;, for supplying the Arm hardware used in our experiments and for providing guidance on its usage.&lt;/li&gt;
&lt;/ul&gt;
</content>
</entry>
<entry>
<title>Compiling Machine Learning to WASM and WebGPU with Apache TVM</title>
<link href="https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu"/>
<updated>2020-05-14T00:00:00-04:00</updated>
<id>https://tvm.apache.org/2020/05/14/compiling-machine-learning-to-webassembly-and-webgpu</id>
<content type="html">&lt;p&gt;&lt;strong&gt;TLDR&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;We introduced support for WASM and WebGPU to the Apache TVM deep learning compiler. Our experiments shows that TVM’s WebGPU backend can get &lt;strong&gt;close to native&lt;/strong&gt; &lt;strong&gt;GPU performance&lt;/strong&gt; when deploying models to the web.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/webgpu/webgpu-mobilenet-perf.png&quot; alt=&quot;image&quot; width=&quot;55%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;Computing is one of the pillars of modern machine learning applications. The introduction of the GPU to accelerate deep learning workloads has increased the rate of progress dramatically. Given the growing requirement to deploy machine learning everywhere, the browser becomes a natural place to deploy intelligent applications.&lt;/p&gt;
&lt;p&gt;While TensorFlow.js and ONNX.js are existing efforts to bring machine learning to the browser, there still exist non-trivial gaps in performance between the web versions and native ones. One of the many reasons is the lack of standard and performant access to the GPU on the web. WebGL lacks important features such as compute shaders and generic storage buffers that are necessary for high performance deep learning.&lt;/p&gt;
&lt;p&gt;WebGPU is the upcoming standard for next generation web graphics which has the possibility to dramatically change this situation. Like the latest generation graphics APIs such as Vulkan and Metal, WebGPU offers first-class compute shader support.&lt;/p&gt;
&lt;p&gt;To explore the potential of using WebGPU for machine learning deployment in the browser, we enhanced the deep learning compiler Apache(incubating) TVM to target WASM (for host code that computes the launching parameters and calls into the device launch) and WebGPU (for device execution). Our preliminary results are quite positive — for the first time, we can deploy machine learning applications on the web while still getting near native performance on the GPU.&lt;/p&gt;
&lt;h2 id=&quot;machine-learning-compiler&quot;&gt;Machine Learning Compiler&lt;/h2&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/webgpu/ml-compiler-flow.png&quot; alt=&quot;image&quot; width=&quot;65%&quot; /&gt;&lt;br /&gt;&lt;/p&gt;
&lt;p&gt;One natural reaction when trying out WebGPU is to write shaders for primitive operators in deep neural networks (matrix multiplication and convolution) and then directly optimize their performance. This is the traditional workflow used by existing frameworks such as TensorFlow.js.&lt;/p&gt;
&lt;p&gt;Instead, we apply a compilation based approach. TVM automatically ingests models from high-level frameworks such as TensorFlow, Keras, PyTorch, MXNet and ONNX and uses a machine learning driven approach to automatically generate low level code, in this case compute shaders in SPIR-V format. The generated code can then be packaged as a deployable module.&lt;/p&gt;
&lt;p&gt;One important advantage of the compilation based approach is the reuse of infrastructure. We are able to effortlessly (relative to &lt;a href=&quot;https://arxiv.org/abs/1901.05350&quot;&gt;other approaches&lt;/a&gt;) target the web by reusing the infrastructure for optimizing GPU kernels for native platforms such as CUDA, Metal and OpenCL. If the mapping of the WebGPU API to native APIs is efficient we can expect similar performance with very little work. More importantly, the &lt;a href=&quot;https://tvm.apache.org/2018/10/03/auto-opt-all&quot;&gt;AutoTVM&lt;/a&gt; infrastructure allows us to specialize the compute shaders for specific models, enabling the generation of the best compute shaders for our specific model of interest.&lt;/p&gt;
&lt;h2 id=&quot;building-a-wasm-and-webgpu-compiler&quot;&gt;Building a WASM and WebGPU Compiler&lt;/h2&gt;
&lt;p&gt;In order to build a compiler that can target WASM and WebGPU, we need the following elements:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;A SPIR-V generator for compute shaders.&lt;/li&gt;
&lt;li&gt;A WASM generator for the host program.&lt;/li&gt;
&lt;li&gt;A runtime to load and execute the generated program.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Luckily, TVM already has a SPIR-V target for Vulkan, and uses LLVM for host code generation. So we can just repurpose the two to generate the device and host programs.&lt;/p&gt;
&lt;p&gt;The main challenge is the runtime. We need a runtime to load the shader code, and to enable the host code talk to communicate with the shader correctly. TVM has a minimum C++ based runtime. We build a minimum web runtime library and link it with the generated shader and host driving code, producing a single WASM file. However, this WASM module still contains two unknown dependencies:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The runtime needs to call into system library calls (malloc, stderr).&lt;/li&gt;
&lt;li&gt;The wasm runtime needs to interact with the WebGPU driver (in javascript where the WebGPU API is the first-class citizen).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;WASI is a standard solution to solve the first problem. While there is not yet a mature WASI on the web, we can use emscripten to generate a WASI-like library (see discussion &lt;a href=&quot;https://github.com/emscripten-core/emscripten/issues/11075&quot;&gt;here&lt;/a&gt;) to provide these system libraries.&lt;/p&gt;
&lt;p&gt;We solve the second problem by building a WebGPU runtime inside TVM’s JS runtime, and calling back to these functions from the WASM module when invoking GPU code. Using the &lt;a href=&quot;https://tvm.apache.org/docs/dev/runtime.html#packedfunc&quot;&gt;PackedFunc&lt;/a&gt; mechanism in TVM’s runtime system, we can directly expose high-level runtime primitives by passing JavaScript closures to the WASM interface. This approach keeps most of the runtime code in JavaScript, we could bring more JS code into the WASM runtime as WASI and WASM support matures.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/webgpu/tvm-wasm-stack.png&quot; alt=&quot;image&quot; width=&quot;65%&quot; /&gt;&lt;/p&gt;
&lt;h2 id=&quot;performance&quot;&gt;Performance&lt;/h2&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/webgpu/webgpu-mobilenet-perf.png&quot; alt=&quot;image&quot; width=&quot;65%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;We ran a quick experiment comparing the execution of a full computational graph via TVM’s WebGPU backend and native targets that use native GPU runtimes (Metal and OpenCL). On the MobileNet model, we can find that the WebGPU can get close to matching the performance of Metal. Assuming Chrome WebGPU’s runtime targets Metal instead of OpenCL on the MacOS, we can safely assume there is little to no performance loss when targeting the GPU.&lt;/p&gt;
&lt;p&gt;This benchmark excludes the CPU to GPU data copy cost and only benchmarks the GPU execution. Currently the data copy from CPU to GPU can still take 25% of the execution time; however, these costs can further be amortized via approaches like double buffering in a continuous execution setting.&lt;/p&gt;
&lt;p&gt;Our reported end-to-end running time of mobilenet is by no means optimal, since we simply reused a tuned programs from GTX 1080 Ti, which is very different from the Intel graphics GPU. We expect further performance boost by using &lt;a href=&quot;https://tvm.apache.org/2018/10/03/auto-opt-all&quot;&gt;AutoTVM&lt;/a&gt; on the target platform of interest.&lt;/p&gt;
&lt;h2 id=&quot;looking-to-the-future&quot;&gt;Looking to the Future&lt;/h2&gt;
&lt;p&gt;Our results suggest many interesting opportunities for machine learning on the web. Notably, WebGPU is an API that is still evolving and its implications could go beyond web applications. For example one could target native APIs of WebGPU as it matures and becomes standardized through WASI, enabling standalone WASM applications that make use of WebGPU.&lt;/p&gt;
&lt;p&gt;The TVM community is also actively working on a &lt;a href=&quot;https://github.com/apache/incubator-tvm/tree/master/rust&quot;&gt;Rust based runtime&lt;/a&gt; that would enable much more robust WASM support and enable easier interaction with projects like &lt;a href=&quot;https://github.com/gfx-rs/wgpu-rs&quot;&gt;wgpu&lt;/a&gt;, and the &lt;a href=&quot;https://rustwasm.github.io/docs/book/&quot;&gt;Rust WASM&lt;/a&gt; ecosystem. As an open source project, we are looking for contributors who can bring in new ideas and help push the project in these exciting directions.&lt;/p&gt;
&lt;p&gt;The proposed approach provides effective machine learning support for most WASM’s application scenarios. The close to native performance could unlock better &lt;a href=&quot;https://en.wikipedia.org/wiki/Federated_learning&quot;&gt;federated learning&lt;/a&gt; capabilities on the browser. The same compiled package should also be able to run on native WASM executors to provide sandbox for the applications.&lt;/p&gt;
&lt;h2 id=&quot;show-me-the-code&quot;&gt;Show me the Code&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/tqchen/tvm-webgpu-example&quot;&gt;Example project for image classification&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/tree/master/web&quot;&gt;Apache TVM on github&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;acknowledgement&quot;&gt;Acknowledgement&lt;/h2&gt;
&lt;p&gt;We would like to thank the emscripten project for providing the WASM compilation infrastructures as well as the JS library support on the web. We would also like to thank the WebGPU community for various helpful discussions. Thanks to Fletcher Haynes for valuable feedbacks to the post.&lt;/p&gt;
</content>
</entry>
<entry>
<title>Integrating TVM into PyTorch</title>
<link href="https://tvm.apache.org/2019/05/30/pytorch-frontend"/>
<updated>2019-05-30T00:00:00-04:00</updated>
<id>https://tvm.apache.org/2019/05/30/pytorch-frontend</id>
<content type="html">&lt;p&gt;As TVM continuously demonstrates improvements to the efficiency of deep learning execution,
it has become clear that PyTorch stands to benefit from directly leveraging the compiler stack.
A major tenet of PyTorch is providing seamless and robust integrations that don’t get in the user’s way.
To that end, PyTorch now has an official TVM-based backend, &lt;a href=&quot;https://github.com/pytorch/tvm&quot;&gt;torch_tvm&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Usage is simple:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;import torch_tvm
torch_tvm.enable()
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;That’s it! PyTorch will then attempt to convert all operators it can to known Relay operators during its JIT compilation process.&lt;/p&gt;
&lt;h3 id=&quot;background&quot;&gt;Background&lt;/h3&gt;
&lt;p&gt;Unlike many other ML frameworks, PyTorch exposes an eager-execution programming interface. This style of programming avoids graph-based meta-programming and focuses on the direct manipulation of n-dimensional arrays (tensors) in a Pythonic way. As such, the framework was initially well suited for the experimentation and development of models, but not for automatic performance optimization or deployment. To leverage optimizing compiler techniques, some large changes were recently introduced to PyTorch to solve this problem.&lt;/p&gt;
&lt;p&gt;&lt;img src=&quot;https://i.imgur.com/4XVHbJE.png&quot; alt=&quot;TVM Integration&quot; /&gt;&lt;/p&gt;
&lt;p&gt;PyTorch 1.0 introduced PyTorch IR, a PyTorch-specific intermediate representation for models similar to Relay. PyTorch programs can be converted into the IR via model tracing, which records the execution of a model or TorchScript, a subset of Python. The new TVM backend lowers PyTorch IR to Relay, and is able to transparently improve PyTorch performance with little user involvement.&lt;/p&gt;
&lt;h3 id=&quot;integration-and-results&quot;&gt;Integration and Results&lt;/h3&gt;
&lt;p&gt;To support Relay, two features were added to the PyTorch JIT: custom transformation passes and custom subgraph interpreters.&lt;/p&gt;
&lt;p&gt;When &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;torch_tvm&lt;/code&gt; is enabled, subgraphs of PyTorch IR that can be converted to Relay &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Expr&lt;/code&gt;s will be marked as Relay-compatible. Since PyTorch IR does not always contain shape information, none of the subgraphs can be compiled in a useful way before invocation.&lt;/p&gt;
&lt;p&gt;During user invocation, the PyTorch JIT runtime will determine input shape information and compile the previously marked subgraphs with the new Relay C++ &lt;a href=&quot;https://github.com/pytorch/tvm/blob/main/torch_tvm/compiler.cpp#L226-L246&quot;&gt;build system&lt;/a&gt;. The compilation is cached based on input shapes for subsequent runs. More details can be found in the &lt;a href=&quot;https://github.com/pytorch/tvm/blob/main/README.md&quot;&gt;README&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;torch_tvm&lt;/code&gt; has a continuous benchmark system set up, which is monitoring the performance of ResNet18 on CPU.
Out of the box TVM provides over two times the performance of the default PyTorch JIT backend for various ResNet models.
Below is a graph that details the iterations per second achieved with 16 threads on an AWS c5n.4xlarge instance (larger is better):&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;https://i.imgur.com/KfJ7oas.png&quot; alt=&quot;bench&quot; width=&quot;90%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;These results are quite encouraging, and the project will continue to focus on improving CPU inference speed across more models.&lt;/p&gt;
&lt;h3 id=&quot;future-work&quot;&gt;Future work&lt;/h3&gt;
&lt;p&gt;Right now the PyTorch JIT does a lot of work to find pure functional subsets of its IR to feed to Relay. This avoids the need to map aliasing and control flow information to Relay, but is not necessary. Mapping more of the PyTorch IR to Relay may yield performance wins and is a goal of the project. PyTorch IR is rapidly changing as it is being developed, so this must be done carefully.&lt;/p&gt;
&lt;p&gt;More work will be done to ensure the hand off between PyTorch and TVM code is efficient. This includes unifying the threading model, allocators and reducing the overhead associated with copying inputs into TVM.&lt;/p&gt;
&lt;h3 id=&quot;tutorial&quot;&gt;Tutorial&lt;/h3&gt;
&lt;p&gt;If you have an already written PyTorch model, the easiest way to get started comes from using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;torch.jit.trace&lt;/code&gt; as follows&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;import torch_tvm
from your_model import model, inputs
torch_tvm.enable(opt_level=3)
iters = 100
warmup = 10
# Ensure your model is in eval mode and also turn off gradients.
with torch.no_grad():
# Use tuned parameters for better performance.
with autotvm.apply_history_best(&quot;test/autotvm_tuning.log&quot;):
# This is where all the compilation happens.
trace_tvm = torch.jit.trace(model, inputs)
# Warmup
for _ in range(warmup):
_ = trace_tvm(*inputs)
# Benchmark
start = time.time()
for _ in range(iters):
_ = trace_tvm(*inputs)
tvm_time = time.time() - start
print(&quot;Took {}s to run {} iters&quot;.format(tvm_time, iters))
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Much of this code comes from &lt;a href=&quot;https://github.com/pytorch/tvm/blob/main/test/benchmarks.py&quot;&gt;benchmarks.py&lt;/a&gt;. Note that tuned parameters for AVX2 LLVM compilation is in the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;test/&lt;/code&gt; folder of the repo.&lt;/p&gt;
&lt;p&gt;If you are more comfortable using Relay directly, it is possible to simply extract the expression directly from a
PyTorch function either via (implicit) tracing or TorchScript:&lt;/p&gt;
&lt;div class=&quot;language-plaintext highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;def add(a, b, c):
return a + b + c
# via tracing
relay_graph = torch_tvm.to_relay(add, inputs)
@torch.jit.script
def mul(a, b, c):
return a * b * c
# via script
relay_graph = torch_tvm.to_relay(mul, inputs)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
</content>
</entry>
<entry>
<title>Automating Optimization of Quantized Deep Learning Models on CUDA</title>
<link href="https://tvm.apache.org/2019/04/29/opt-cuda-quantized"/>
<updated>2019-04-29T12:00:00-04:00</updated>
<id>https://tvm.apache.org/2019/04/29/opt-cuda-quantized</id>
<content type="html">&lt;p&gt;Deep learning has been successfully applied to a variety of tasks.
On real-time scenarios such as inference on autonomous vehicles, the inference speed of the model is critical.
Network quantization is an effective approach to accelerating deep learning models.
In quantized models, both data and model parameters are represented with low precision data types such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int8&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;float16&lt;/code&gt;.
The lowered data bandwidth reduces the inference time and memory/storage requirements, as well as the power consumption.
Meanwhile, under proper quantization schemes, we can minimize the accuracy drops of the quantized models.
Therefore, quantized models are of particular interests of researchers and developers as it makes large models suitable to deploy on diverse devices, such as GPU, CPU and mobile devices.&lt;/p&gt;
&lt;p&gt;Previously, quantized operators are usually optimized with handcrafted microkernels for different workloads, or rely on blackbox proprietary solutions such as cuDNN and TensorRT.
Writing a high-performance microkernel in assembly can be very challenging and usually requires heavy engineering effort.
Besides, it is difficult to adapt these ad-hoc microkernels to emerging workloads and new devices.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/cuda-quantized/benchmark.svg&quot; alt=&quot;image&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Figure 1. Inference time of different models on TVM, TensorRT, and MXNet &lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;TVM solves this challenge with a full stack compiler and a machine-learning-based optimizer to automatically generate computing kernels.
TVM can generate efficient kernels via automatic search in a human-designed search space.
In standard workloads such as VGG and ResNet, TVM achieves competitive performance compared with other state-of-the-art frameworks.
In emerging models such as ResNeXt and Deformable ConvNets, the automatic optimization makes it easy for TVM to adapt to these new workloads and achieve a significant performance boost.&lt;/p&gt;
&lt;p&gt;In this post, we show how to use TVM to automatically optimize of quantized deep learning models on CUDA.&lt;/p&gt;
&lt;h1 id=&quot;expressing-quantized-cuda-kernels-in-tvm&quot;&gt;Expressing Quantized CUDA Kernels in TVM&lt;/h1&gt;
&lt;h2 id=&quot;leveraging-tensor-intrinsics-via-tensorization&quot;&gt;Leveraging Tensor Intrinsics via Tensorization&lt;/h2&gt;
&lt;p&gt;Many platforms provide architecture-specific instructions for special computation patterns, for example, the SIMD instructions on x86, and the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp4a&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;hfma&lt;/code&gt; instructions on CUDA.
These intrinsic instructions are highly optimized for specific devices.
By leveraging hardware intrinsics, we can achieve a significant performance boost for quantized operators.&lt;/p&gt;
&lt;p&gt;Currently, &lt;a href=&quot;https://devblogs.nvidia.com/mixed-precision-programming-cuda-8/&quot;&gt;dp4a&lt;/a&gt; has been extensively used in TVM int8 operators on CUDA.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp4a&lt;/code&gt; is a CUDA intrinsic on Compute Capability 6.1 devices.
It is a mixed-precision instruction that provides the efficient computation of the dot product between two 4-element 8-bit integer vectors and accumulates the result in 32-bit format.
Using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp4a&lt;/code&gt;, we can implement a dot product between 8-bit integer vectors with number of elements evenly divisible by four.
With an efficient dot product operator, we can implement high-level operators such as 2d convolution and dense layers as these operators are commonly backed by dot products.&lt;/p&gt;
&lt;p&gt;To illustrate, in 2d convolution we accumulate along the channel, the width, and the height axis of the kernel.
This is a typical use case of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp4a&lt;/code&gt;.
TVM uses tensorization to support calling external intrinsics.
We do not need to modify the original computation declaration; we use the schedule primitive &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;tensorize&lt;/code&gt; to replace the accumulation with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp4a&lt;/code&gt; tensor intrinsic.
More details of tensorization can be found in the &lt;a href=&quot;https://tvm.apache.org/docs//tutorials/language/tensorize.html&quot;&gt;tutorial&lt;/a&gt;.&lt;/p&gt;
&lt;h2 id=&quot;data-layout-rearrangement&quot;&gt;Data Layout Rearrangement&lt;/h2&gt;
&lt;p&gt;One of the challenges in tensorization is that we may need to design special computation logic to adapt to the requirement of tensor intrinsics.
Although it is natural to accumulate along the inner axis of the tensor in the dense operator, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d&lt;/code&gt; can be more challenging.
In &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d&lt;/code&gt; we expect to take a slice in the channel dimension as the input of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp4a&lt;/code&gt; because the number of channels is typically multiple of 4 (otherwise we fall back to original &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d&lt;/code&gt; in NCHW layout).
Meanwhile, to achieve memory locality, we would like to reduce along the innermost axis first.
Taking these factors into account, we use a custom data layout to address this challenge.&lt;/p&gt;
&lt;p&gt;In CUDA int8 2d convolution, we empirically choose &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NCHW4c&lt;/code&gt; as data layout and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OIHW4o4i&lt;/code&gt; as weight layout.
The templates can also be easily generalized to &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;NCHW[x]c&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;OIHW[x]o[x]i&lt;/code&gt;, where x is an arbitrary positive integer divisible by four.
In the data layout we choose, slices of channels are in the packed innermost dimension.
Likewise, we pack slices in both the input and output channel dimensions of the weight so that the output has a consistent data layout with the input, which prevents redundant layout transformations between layers.&lt;/p&gt;
&lt;p&gt;We show the computation of one element of the output of the 2d convolution in Figure 2.
The element in each position of the super dimension (the outer dimension of the blocked layout which contains packed elements) NCHW and OIHW is the packed input and kernel, respectively.
Each column of the packed kernel comes from a different filter.
We calculate the dot product between the packed input and each row in the packed kernel using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dp4a&lt;/code&gt;, and accumulate the result to the output tensor.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/cuda-quantized/conv2d.png&quot; alt=&quot;image&quot; width=&quot;60%&quot; /&gt;&lt;/p&gt;
&lt;div&gt;
Figure 2. 2D convolution with data layout in NCHW4c and weight layout in OIHW4o4i.
&lt;b&gt;Left&lt;/b&gt;: The input tensor in NCHW4c layout. One moving filter of the kernel is colored in blue. One element of the input and kernel is colored in grey.
&lt;b&gt;Mid&lt;/b&gt;: The packed input and kernel in the grey block.
&lt;b&gt;Right&lt;/b&gt;: The output in NCHW4c layout. Inside the one element depicted, there are four packed elements in channel sub-dimension.
&lt;/div&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;After we have specified the layout of convolution layers, other operators such as &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;add&lt;/code&gt; and activations can automatically adapt to the chosen layout during the &lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/src/relay/pass/alter_op_layout.cc&quot;&gt;AlterOpLayout&lt;/a&gt; pass in Relay.
The layout transformation of the weight can be precomputed offline. Therefore, we can run the whole model in the same layout without extra overhead.&lt;/p&gt;
&lt;h2 id=&quot;designing-search-space-for-automatic-optimization&quot;&gt;Designing Search Space for Automatic Optimization&lt;/h2&gt;
&lt;p&gt;The key to achieving good performance in our quantized operators is to integrate with machine-learning-based automatic optimization. One question is how to design an effective schedule search space.
An effective schedule template means that we can obtain good performance in a reasonable number of iterations in automatic tuning.
Generally speaking, we strive to define a flexible template to cover different configurations in the search space.
On the other hand, we also take advantage of the prior knowledge in performance optimization.
For example, as caching data in the shared memory is a common practice in CUDA programming, we utilize shared memory, but we use machine learning to choose the best tile size.
We also do some manual tiling such as splitting axes by 4 or 16 to facilitate vectorized memory access.&lt;/p&gt;
&lt;p&gt;In quantized 2d convolution, we design a search space that includes a set of tunable options, such as the tile size, the axes to fuse, configurations of loop unrolling and double buffering.
The templates of quantized &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;conv2d&lt;/code&gt; and &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;dense&lt;/code&gt; on CUDA are registered under template key &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;int8&lt;/code&gt;.
During automatic tuning, we can create tuning tasks for these quantized operators by setting the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;template_key&lt;/code&gt; argument.
Details of how to launch automatic optimization can be found in the &lt;a href=&quot;https://tvm.apache.org/docs//tutorials/autotvm/tune_relay_cuda.html&quot;&gt;AutoTVM tutorial&lt;/a&gt;.&lt;/p&gt;
&lt;h1 id=&quot;general-workflow&quot;&gt;General Workflow&lt;/h1&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/cuda-quantized/workflow.png&quot; alt=&quot;image&quot; width=&quot;60%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Figure 3. Workflow of running quantized models &lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;TVM provides an easy workflow to quantize trained models from other frameworks, automatically optimize operators (with AutoTVM), and deploy to different devices.&lt;/p&gt;
&lt;p&gt;First, we use the Relay frontend to import existing models. Here we use an MXNet model with &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;(1, 3, 224, 224)&lt;/code&gt; input shape as an example.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;sym&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg_params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aux_params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;mxnet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;load_checkpoint&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;model_path&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;epoch&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;from_mxnet&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;sym&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;shape&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;'data'&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;:&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;224&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;224&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)},&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;arg_params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;arg_params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;aux_params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;aux_params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Next, we use the relay quantization API to convert it to a quantized model.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;n&quot;&gt;net&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quantize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;quantize&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;params&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;Then, we use AutoTVM to extract tuning tasks for the operators in the model and perform automatic optimization. The &lt;a href=&quot;https://tvm.apache.org/docs//tutorials/autotvm/tune_relay_cuda.html&quot;&gt;AutoTVM tutorial&lt;/a&gt; provides an example for this.&lt;/p&gt;
&lt;p&gt;Finally, we build the model and run inference in the quantized mode.&lt;/p&gt;
&lt;div class=&quot;language-python highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;&lt;span class=&quot;k&quot;&gt;with&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build_config&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;opt_level&lt;/span&gt;&lt;span class=&quot;o&quot;&gt;=&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;):&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;graph&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;lib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;params&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;relay&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;build&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;net&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;target&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;The result of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;relay.build&lt;/code&gt; is a deployable library.
We can either run inference &lt;a href=&quot;https://tvm.apache.org/docs//tutorials/frontend/from_mxnet.html#execute-the-portable-graph-on-tvm&quot;&gt;on the GPU&lt;/a&gt; directly or deploy &lt;a href=&quot;https://tvm.apache.org/docs//tutorials/frontend/deploy_model_on_rasp.html#deploy-the-model-remotely-by-rpc&quot;&gt;on the remote devices&lt;/a&gt; via RPC.&lt;/p&gt;
&lt;h1 id=&quot;benchmark&quot;&gt;Benchmark&lt;/h1&gt;
&lt;p&gt;To verify the performance of the quantized operators in TVM, we benchmark the performance of several popular network models including VGG-19, ResNet-50 and Inception V3.
We also benchmark on DRN-C-26, ResNeXt-50, and DCN-ResNet-101 from &lt;a href=&quot;https://github.com/msracver/Deformable-ConvNets&quot;&gt;Deformable ConvNets&lt;/a&gt; to show the performance of emerging models, which contains less conventional operators such as dilated convolutions, group convolutions and deformable convolutions.
We choose NVIDIA TensorRT as our baseline.
The result of MXNet 1.4 + cuDNN 7.3 in float32 mode is reported to show the speed up of quantization.
The experiments are conducted on NVIDIA GTX 1080.
We report the inference time per image when running in batch size = 1 and 16.&lt;/p&gt;
&lt;p&gt;As shown in the Figure 1, TVM achieves up to 8x speedup using quantization.
In standard CNN models such as VGG and ResNet, TVM achieves parity with the state-of-the-art results from TensorRT.&lt;/p&gt;
&lt;p&gt;When benchmarking emerging models, TVM achieves impressive results.
We obtain significant performance gains on ResNeXt and DCN-ResNet-101.
Results of DCN-ResNet-101 of TensorRT are not available because there is no official implementation of the deformable convolution.
We show that automatic optimization in TVM makes it easy and flexible to support and optimize emerging workloads.&lt;/p&gt;
&lt;h1 id=&quot;show-me-the-code&quot;&gt;Show Me the Code&lt;/h1&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/vinx13/tvm-cuda-int8-benchmark&quot;&gt;Benchmark&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/conv2d_int8.py&quot;&gt;CUDA int8 conv2d&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/group_conv2d_nchw.py&quot;&gt;CUDA int8 group conv2d&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/dense.py&quot;&gt;CUDA int8 dense&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/topi/python/topi/cuda/tensor_intrin.py&quot;&gt;Tensor intrinsics declaration&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h1 id=&quot;bio--acknowledgement&quot;&gt;Bio &amp;amp; Acknowledgement&lt;/h1&gt;
&lt;p&gt;&lt;a href=&quot;https://wuwei.io/&quot;&gt;Wuwei Lin&lt;/a&gt; is an undergraduate student at SJTU. He is currently an intern at TuSimple. The author has many thanks to &lt;a href=&quot;https://homes.cs.washington.edu/~tqchen/&quot;&gt;Tianqi Chen&lt;/a&gt; and &lt;a href=&quot;https://homes.cs.washington.edu/~eqy/&quot;&gt;Eddie Yan&lt;/a&gt; for their reviews.&lt;/p&gt;
</content>
</entry>
<entry>
<title>TVM Deep Learning Compiler Joins Apache Software Foundation</title>
<link href="https://tvm.apache.org/2019/03/18/tvm-apache-announcement"/>
<updated>2019-03-18T00:00:00-04:00</updated>
<id>https://tvm.apache.org/2019/03/18/tvm-apache-announcement</id>
<content type="html">&lt;p&gt;There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms – such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) – requires significant manual effort.&lt;/p&gt;
&lt;p&gt;TVM is an open source deep learning compiler stack that closes the gap between the productivity-focused deep learning frameworks, and the performance- or efficiency-oriented hardware backends. Today, we are glad to announce that the TVM community has decided to move on to Apache incubator, and becomes an Apache(incubating) project.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/main/tvm-stack.png&quot; alt=&quot;image&quot; width=&quot;70%&quot; /&gt;&lt;/p&gt;
&lt;p&gt;TVM stack began as a research project at the &lt;a href=&quot;https://sampl.cs.washington.edu/&quot;&gt;SAMPL group&lt;/a&gt; of Paul G. Allen School of Computer Science &amp;amp; Engineering, University of Washington. The project uses the loop-level IR and several optimizations from the &lt;a href=&quot;http://halide-lang.org/&quot;&gt;Halide project&lt;/a&gt;, in addition to &lt;a href=&quot;https://tvm.apache.org/about&quot;&gt;a full deep learning compiler stack&lt;/a&gt; to support machine learning workloads for diverse hardware backends.&lt;/p&gt;
&lt;p&gt;Since its introduction, the project was driven by an open source community involving multiple industry and academic institutions. Currently, the TVM stack includes a high-level differentiable programming IR for high-level optimization, a machine learning driven program optimizer and VTA – a fully open sourced deep learning accelerator. The community brings innovations from machine learning, compiler systems, programming languages, and computer architecture to build a full-stack open source deep learning compiler system. The project has been used in production in &lt;a href=&quot;https://sampl.cs.washington.edu/tvmconf/#about-tvmconf&quot;&gt;several major companies&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;Besides the technical innovations, the community adopts an open, welcoming and neutral policy. The project is run by committers who are elected purely based on their merit of the contributions to the project. Besides the contributors from UW SAMPL, the community now has nearly 200 contributors that come from Amazon Web Services (AWS), Qualcomm, Facebook, Google, Huawei, AMD, Microsoft, Cornell University, University of California, Berkeley, and more. The community successfully organized the first developer conference last December which attracted more than 180 attendees from all around the world. Moving forward to the Apache, we will continue to exercise this principle in an effort to bring deep learning compilation to everyone.&lt;/p&gt;
&lt;p&gt;We would like to take this chance to thank the Allen School for supporting the SAMPL team that gave birth to the TVM project. We would also like to thank the Halide project which provided the basis for TVM’s loop-level IR and initial code generation. We would like to thank our Apache incubator mentors for introducing the project to Apache and providing useful guidance. Finally, we would like to thank the TVM community and all of the organizations, as listed above, that supported the developers of TVM.&lt;/p&gt;
&lt;p&gt;See also the &lt;a href=&quot;https://news.cs.washington.edu/2019/03/18/allen-schools-tvm-deep-learning-compiler-framework-transitions-to-apache/&quot;&gt;Allen School news about the transition here&lt;/a&gt;, &lt;a href=&quot;https://sampl.cs.washington.edu/tvmconf/#about-tvmconf&quot;&gt;TVM conference program slides and recordings&lt;/a&gt;, and &lt;a href=&quot;https://tvm.apache.org/docs//contribute/community.html&quot;&gt;our community guideline here&lt;/a&gt;. Follow us on Twitter: &lt;a href=&quot;https://twitter.com/ApacheTVM&quot;&gt;@ApacheTVM&lt;/a&gt;.&lt;/p&gt;
</content>
</entry>
<entry>
<title>TVM Golang Runtime for Deep Learning Deployment</title>
<link href="https://tvm.apache.org/2019/01/19/Golang"/>
<updated>2019-01-19T00:00:00-05:00</updated>
<id>https://tvm.apache.org/2019/01/19/Golang</id>
<content type="html">&lt;h2 id=&quot;introduction&quot;&gt;Introduction&lt;/h2&gt;
&lt;p&gt;TVM is an open deep learning compiler stack to compile various deep learning models from different
frameworks to CPU, GPU or specialized accelerators. TVM supports model compilation from a wide range
of front ends like Tensorflow, Onnx, Keras, Mxnet, Darknet, CoreML and Caffe2. TVM compiled modules
can be deployed on backends like LLVM (Javascript or WASM, AMD GPU, ARM or X86), NVidia GPU (CUDA),
OpenCL and Metal.&lt;/p&gt;
&lt;p&gt;TVM supports runtime bindings for programming languages like Javascript, Java, Python, C++… and now Golang.
With a wide range of frontend, backend and runtime bindings, TVM enables developers to integrate and
deploy deep learning models from a variety of frameworks to a choice of hardware via many programming languages.&lt;/p&gt;
&lt;p&gt;The TVM import and compilation process generates a graph JSON, a module and a params. Any application that
integrates the TVM runtime can load these compiled modules and perform inference. A detailed tutorial of module
import and compilation using TVM can be found at &lt;a href=&quot;https://tvm.apache.org/docs//tutorials/&quot;&gt;tutorials&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;TVM now supports deploying compiled modules through Golang. Golang applications can make use of this
to deploy the deep learning models through TVM. The scope of this blog is the introduction of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gotvm&lt;/code&gt; package,
the package build process and a sample application using &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gotvm&lt;/code&gt; to load a compiled module and perform inference.&lt;/p&gt;
&lt;h2 id=&quot;package&quot;&gt;Package&lt;/h2&gt;
&lt;p&gt;The golang package &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gotvm&lt;/code&gt; is built on top of TVM’s C runtime interface. The API in this package
abstracts the native C types and provides Golang compatible types. The package source can be found
at &lt;a href=&quot;https://github.com/dmlc/tvm/tree/master/golang&quot;&gt;gotvm&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;This package leverages golang’s interface, slices, function closures and implicitly handles the
necessary conversions across API calls.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/golang/TVM-Golang-Blog.png&quot; alt=&quot;image&quot; width=&quot;60%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Golang Interface over TVM Runtime &lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;h2 id=&quot;how-to&quot;&gt;How to&lt;/h2&gt;
&lt;p&gt;As shown in the below diagram &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gotvm&lt;/code&gt; enables golang applications to integrate deep learning models
from various frameworks without the hassle of understanding each framework related interface API.
Developers can make use of TVM to import and compile deep learning models and generate TVM artifacts.
&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gotvm&lt;/code&gt; package provides golang friendly API to load, configure, feed input and get output.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/golang/TVM-Golang-Flow.png&quot; alt=&quot;image&quot; width=&quot;100%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Import, Compile, Integrate and Deploy&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;TVM &lt;a href=&quot;https://tvm.apache.org/docs//tutorials/#compile-deep-learning-models&quot;&gt;Compile Deep Learning Models&lt;/a&gt; tutorials
are available to compile models from all frameworks supported by the TVM frontend. This compilation process
generates the artifacts required to integrate and deploy the model on a target.&lt;/p&gt;
&lt;h2 id=&quot;api&quot;&gt;API&lt;/h2&gt;
&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gotvm&lt;/code&gt; package provides a handful of datatypes and API functions to initialize, load and infer
from a golang application. Like any other golang package we just need to import &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gotvm&lt;/code&gt; package here.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Module : The Module API can be used to load a TVM compiled module into TVM runtime and access any functions.&lt;/li&gt;
&lt;li&gt;Value : The Value API provides helper functions to set arguments or get return values in golang types like basic types or slices.&lt;/li&gt;
&lt;li&gt;Function : The Function API is useful for getting handles to functions and invoking them.&lt;/li&gt;
&lt;li&gt;Array : The Array API is useful for setting and getting Tensor data via golang slice.&lt;/li&gt;
&lt;li&gt;Context : The Context API contains helper functions to build backend context handles.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;example&quot;&gt;Example&lt;/h2&gt;
&lt;p&gt;A simple example with inline documentation of loading a compiled module and performing inference is shown below.
For simplicity the error handling is ignored here, but is important in real applications.&lt;/p&gt;
&lt;div class=&quot;language-cpp highlighter-rouge&quot;&gt;&lt;div class=&quot;highlight&quot;&gt;&lt;pre class=&quot;highlight&quot;&gt;&lt;code&gt;
&lt;span class=&quot;n&quot;&gt;package&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;main&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Import compiled gotvm package.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;import&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;s&quot;&gt;&quot;./gotvm&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Some constants for TVM compiled model paths.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// modLib : Is the compiled library exported out of compilation.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// modJson : TVM graph JSON.&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// modParams : Exported params out of TVM compilation process.&lt;/span&gt;
&lt;span class=&quot;k&quot;&gt;const&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;modLib&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;./libdeploy.so&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;modJSON&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;./deploy.json&quot;&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;modParams&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;./deploy.params&quot;&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// main&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;func&lt;/span&gt; &lt;span class=&quot;nf&quot;&gt;main&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Some util API to query underlying TVM and DLPack version information.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fmt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;TVM Version : v%v&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;TVMVersion&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;fmt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Printf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;DLPACK Version: v%v&lt;/span&gt;&lt;span class=&quot;se&quot;&gt;\n\n&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;DLPackVersion&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Import tvm module (so).&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;modp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;LoadModuleFromFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;modLib&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Load module on tvm runtime - call tvm.graph_runtime.create&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// with module and graph JSON.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ioutil&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ReadFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;modJSON&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;jsonStr&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;string&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetGlobalFunction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;tvm.graph_runtime.create&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;graphrt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Invoke&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;jsonStr&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;modp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;gotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;KDLCPU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;graphmod&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graphrt&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AsModule&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Allocate input &amp;amp; output arrays and fill some data for input.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tshapeIn&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;224&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;224&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;tshapeOut&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;[]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;1&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;1001&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;inX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Empty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tshapeIn&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;s&quot;&gt;&quot;float32&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CPU&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;gotvm&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Empty&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;tshapeOut&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;inSlice&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;make&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;([]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;244&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;244&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;*&lt;/span&gt; &lt;span class=&quot;mi&quot;&gt;3&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;))&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Seed&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;10&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Shuffle&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;len&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inSlice&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;func&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;j&lt;/span&gt; &lt;span class=&quot;kt&quot;&gt;int&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;{&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inSlice&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;i&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;],&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;inSlice&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;[&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;j&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;]&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(),&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;rand&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Float32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt; &lt;span class=&quot;p&quot;&gt;})&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;inX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;CopyFrom&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;inSlice&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Load params&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;ioutil&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;ReadFile&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;modParams&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graphmod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetFunction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;load_params&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Invoke&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;bytes&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Set module input&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graphmod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetFunction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;set_input&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Invoke&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;input&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;inX&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Run or Execute the graph&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graphmod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetFunction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;run&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Invoke&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Get output from runtime.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;graphmod&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;GetFunction&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;s&quot;&gt;&quot;get_output&quot;&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;funp&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;Invoke&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;int64&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;(&lt;/span&gt;&lt;span class=&quot;mi&quot;&gt;0&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;),&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// Access output tensor data.&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;outIntf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;,&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;_&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;out&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;AsSlice&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;()&lt;/span&gt;
&lt;span class=&quot;n&quot;&gt;outSlice&lt;/span&gt; &lt;span class=&quot;o&quot;&gt;:=&lt;/span&gt; &lt;span class=&quot;n&quot;&gt;outIntf&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;.([]&lt;/span&gt;&lt;span class=&quot;n&quot;&gt;float32&lt;/span&gt;&lt;span class=&quot;p&quot;&gt;)&lt;/span&gt;
&lt;span class=&quot;c1&quot;&gt;// outSlice here holds flattened output data as a golang slice.&lt;/span&gt;
&lt;span class=&quot;p&quot;&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;&lt;/div&gt;
&lt;p&gt;&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;gotvm&lt;/code&gt; extends the TVM packed function system to support golang function closures as packed functions.
&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/golang/sample&quot;&gt;Examples&lt;/a&gt; available to register golang
closure as TVM packed function and invoke the same across programming language barriers.&lt;/p&gt;
&lt;h2 id=&quot;show-me-the-code&quot;&gt;Show me the code&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/golang/src&quot;&gt;Package Source&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href=&quot;https://github.com/apache/incubator-tvm/blob/main/golang/sample&quot;&gt;Examples&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;h2 id=&quot;references&quot;&gt;References&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;[1] &lt;a href=&quot;https://golang.org&quot;&gt;Go Programming Lang&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;[2] &lt;a href=&quot;https://blog.golang.org/godoc-documenting-go-code&quot;&gt;Go Documentation Guide Lines&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;[3] &lt;a href=&quot;https://golang.org/pkg/testing&quot;&gt;Go Testcase Framework&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;[4] &lt;a href=&quot;https://golang.org/cmd/cgo&quot;&gt;Go CFFI&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;[5] &lt;a href=&quot;https://blog.learngoprogramming.com/golang-variadic-funcs-how-to-patterns-369408f19085&quot;&gt;Go Variadic Functions&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;[6] &lt;a href=&quot;https://github.com/jdeng/gomxnet&quot;&gt;CFFI Ref&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;[7] &lt;a href=&quot;https://golang.org/pkg/runtime/#SetFinalizer&quot;&gt;Go Finalizers&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;
</content>
</entry>
<entry>
<title>Automating Generation of Low Precision Deep Learning Operators</title>
<link href="https://tvm.apache.org/2018/12/18/lowprecision-conv"/>
<updated>2018-12-18T00:00:00-05:00</updated>
<id>https://tvm.apache.org/2018/12/18/lowprecision-conv</id>
<content type="html">&lt;p&gt;As deep learning models grow larger and more complex, deploying them on low powered phone and IoT
devices becomes challenging because of their limited compute and energy budgets. A recent trend
in deep learning is the use of extremely quantized models that operate on inputs and
weights of a few bits, with networks like XNOR-Net, DoReFa-Net, and HWGQ-Net making steady
progress improving accuracy.&lt;/p&gt;
&lt;p&gt;An example of a low precision graph snippet is below. The low precision convolution takes in
quantized data and bitpacks into the proper data layout for an efficient bitserial convolution.
The output is in a higher precision and traditional deep learning layers such as batch normalization and ReLu are applied to it, before being re-quantized and sent through another low precision operator.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/workflow.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Low precision convolution pipeline.&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;Theoretically, low precision operators use less operations than
floating point operators, leading many to believe they can achieve up tremendous speedups.
However, deep learning frameworks leverage decades of engineering work through low level
BLAS and LAPACK libraries that are incredibly well optimized, and CPUs include intrinsic
instructions to accelerate these tasks. In practice, it is not simple to develop low-level
operators such as convolutions that are competitive with 8-bit quantized or even floating
point operators.
In this post we introduce our approach to automatically generating optimized
low precision convolutions for CPUs. We declare our low precision operators so that they compute
on efficiently stored low precision inputs, and describe a schedule that describes a search space
of implementation parameters. We rely on AutoTVM to quickly search the space and find optimized
parameters for the particular convolution, precision, and backend.&lt;/p&gt;
&lt;h2 id=&quot;bitserial-computation-background&quot;&gt;Bitserial Computation Background&lt;/h2&gt;
&lt;p&gt;The core of low precision models is the bitserial dot product that enables convolution and
dense operators to be computed using only bitwise operations and popcount.
Typically, a dot product is computed by element wise multiplication of two vectors followed by
summing all the elements, like the simple example below. If all the data is binary, the input
vectors can be packed into single integer, and the dot product can be computed by bitwise-anding
the packed inputs and counting the number of 1’s in the result using popcount.
Note: Depending how the input data is quantized, bitwise-xnor may be used instead of bitwise-and.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/binary-dotproduct.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;
&lt;center&gt; Binary dot product.&lt;/center&gt;
&lt;p&gt;&lt;/p&gt;
&lt;p&gt;Arbitrary precision dot products can be computed in this fashion by first separating input data
into bitplanes. Once in this representation we can compute dotproduct by summing weighted binary
dot products between the bitplanes of A and B. The number of binary dotproducts grows with the
product of A and B’s precision, so this method is only practical for very low precision data.&lt;/p&gt;
&lt;p style=&quot;text-align: center&quot;&gt;&lt;img src=&quot;/images/low-precision/bitserial-dotproduct.png&quot; alt=&quot;image&quot; width=&quot;50%&quot; /&gt;&lt;/p&gt;