docs/_sources/tutorial/tensor_expr_get_started.rst.txt - tvm-site - Git at Google


 .. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
 .. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
 .. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "tutorial/tensor_expr_get_started.py"

 .. only:: html

     .. note::
         :class: sphx-glr-download-link-note

         This tutorial can be used interactively with Google Colab! You can also click
         :ref:`here <sphx_glr_download_tutorial_tensor_expr_get_started.py>` to run the Jupyter notebook locally.

         .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg
             :align: center
             :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/4459ebf5b03d332f7f380abdaef81c05/tensor_expr_get_started.ipynb
             :width: 300px

 .. rst-class:: sphx-glr-example-title

 .. _sphx_glr_tutorial_tensor_expr_get_started.py:


 .. _tutorial-tensor-expr-get-started:

 Working with Operators Using Tensor Expression
 ==============================================
 **Author**: `Tianqi Chen <https://tqchen.github.io>`_

 In this tutorial we will turn our attention to how TVM works with Tensor
 Expression (TE) to define tensor computations and apply loop optimizations. TE
 describes tensor computations in a pure functional language (that is each
 expression has no side effects). When viewed in context of the TVM as a whole,
 Relay describes a computation as a set of operators, and each of these
 operators can be represented as a TE expression where each TE expression takes
 input tensors and produces an output tensor.

 This is an introductory tutorial to the Tensor Expression language in TVM. TVM
 uses a domain specific tensor expression for efficient kernel construction. We
 will demonstrate the basic workflow with two examples of using the tensor expression
 language. The first example introduces TE and scheduling with vector
 addition. The second expands on these concepts with a step-by-step optimization
 of a matrix multiplication with TE. This matrix multiplication example will
 serve as the comparative basis for future tutorials covering more advanced
 features of TVM.

 .. GENERATED FROM PYTHON SOURCE LINES 44-50

 Example 1: Writing and Scheduling Vector Addition in TE for CPU
 ---------------------------------------------------------------

 Let's look at an example in Python in which we will implement a TE for
 vector addition, followed by a schedule targeted towards a CPU.
 We begin by initializing a TVM environment.

 .. GENERATED FROM PYTHON SOURCE LINES 50-56

 .. code-block:: default


     import tvm
     import tvm.testing
     from tvm import te
     import numpy as np


 .. GENERATED FROM PYTHON SOURCE LINES 57-63

 You will get better performance if you can identify the CPU you are targeting
 and specify it. If you're using LLVM, you can get this information from the
 command ``llc --version`` to get the CPU type, and you can check
 ``/proc/cpuinfo`` for additional extensions that your processor might
 support. For example, you can use ``llvm -mcpu=skylake-avx512`` for CPUs with
 AVX-512 instructions.

 .. GENERATED FROM PYTHON SOURCE LINES 63-66

 .. code-block:: default


     tgt = tvm.target.Target(target="llvm", host="llvm")


 .. GENERATED FROM PYTHON SOURCE LINES 67-81

 Describing the Vector Computation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 We describe a vector addition computation. TVM adopts tensor semantics, with
 each intermediate result represented as a multi-dimensional array. The user
 needs to describe the computation rule that generates the tensors. We first
 define a symbolic variable ``n`` to represent the shape. We then define two
 placeholder Tensors, ``A`` and ``B``, with given shape ``(n,)``. We then
 describe the result tensor ``C``, with a ``compute`` operation. The
 ``compute`` defines a computation, with the output conforming to the
 specified tensor shape and the computation to be performed at each position
 in the tensor defined by the lambda function. Note that while ``n`` is a
 variable, it defines a consistent shape between the ``A``, ``B`` and ``C``
 tensors. Remember, no actual computation happens during this phase, as we
 are only declaring how the computation should be done.

 .. GENERATED FROM PYTHON SOURCE LINES 81-87

 .. code-block:: default


     n = te.var("n")
     A = te.placeholder((n,), name="A")
     B = te.placeholder((n,), name="B")
     C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")


 .. GENERATED FROM PYTHON SOURCE LINES 88-94

 .. admonition:: Lambda Functions

   The second argument to the ``te.compute`` method is the function that
   performs the computation. In this example, we're using an anonymous function,
   also known as a ``lambda`` function, to define the computation, in this case
   addition on the ``i``\th element of ``A`` and ``B``.

 .. GENERATED FROM PYTHON SOURCE LINES 96-118

 Create a Default Schedule for the Computation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 While the above lines describe the computation rule, we can compute ``C`` in
 many different ways to fit different devices. For a tensor with multiple
 axes, you can choose which axis to iterate over first, or computations can be
 split across different threads. TVM requires that the user to provide a
 schedule, which is a description of how the computation should be performed.
 Scheduling operations within TE can change loop orders, split computations
 across different threads, and group blocks of data together, amongst other
 operations. An important concept behind schedules is that they only describe
 how the computation is performed, so different schedules for the same TE will
 produce the same result.

 TVM allows you to create a naive schedule that will compute ``C`` in by
 iterating in row major order.

 .. code-block:: c

   for (int i = 0; i < n; ++i) {
     C[i] = A[i] + B[i];
   }

 .. GENERATED FROM PYTHON SOURCE LINES 118-121

 .. code-block:: default


     s = te.create_schedule(C.op)


 .. GENERATED FROM PYTHON SOURCE LINES 122-133

 Compile and Evaluate the Default Schedule
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 With the TE expression and a schedule, we can produce runnable code for our
 target language and architecture, in this case LLVM and a CPU. We provide
 TVM with the schedule, a list of the TE expressions that are in the schedule,
 the target and host, and the name of the function we are producing. The result
 of the output is a type-erased function that can be called directly from Python.

 In the following line, we use ``tvm.build`` to create a function. The build
 function takes the schedule, the desired signature of the function (including
 the inputs and outputs) as well as target language we want to compile to.

 .. GENERATED FROM PYTHON SOURCE LINES 133-136

 .. code-block:: default


     fadd = tvm.build(s, [A, B, C], tgt, name="myadd")


 .. GENERATED FROM PYTHON SOURCE LINES 137-145

 Let's run the function, and compare the output to the same computation in
 numpy. The compiled TVM function exposes a concise C API that can be invoked
 from any language. We begin by creating a device, which is a device (CPU in this
 example) that TVM can compile the schedule to. In this case the device is an
 LLVM CPU target. We can then initialize the tensors in our device and
 perform the custom addition operation. To verify that the computation is
 correct, we can compare the result of the output of the c tensor to the same
 computation performed by numpy.

 .. GENERATED FROM PYTHON SOURCE LINES 145-155

 .. code-block:: default


     dev = tvm.device(tgt.kind.name, 0)

     n = 1024
     a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)
     b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)
     c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)
     fadd(a, b, c)
     tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())


 .. GENERATED FROM PYTHON SOURCE LINES 156-158

 To get a comparison of how fast this version is compared to numpy, create a
 helper function to run a profile of the TVM generated code.

 .. GENERATED FROM PYTHON SOURCE LINES 158-190

 .. code-block:: default

     import timeit

     np_repeat = 100
     np_running_time = timeit.timeit(
         setup="import numpy\n"
         "n = 32768\n"
         'dtype = "float32"\n'
         "a = numpy.random.rand(n, 1).astype(dtype)\n"
         "b = numpy.random.rand(n, 1).astype(dtype)\n",
         stmt="answer = a + b",
         number=np_repeat,
     )
     print("Numpy running time: %f" % (np_running_time / np_repeat))


     def evaluate_addition(func, target, optimization, log):
         dev = tvm.device(target.kind.name, 0)
         n = 32768
         a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)
         b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)
         c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)

         evaluator = func.time_evaluator(func.entry_name, dev, number=10)
         mean_time = evaluator(a, b, c).mean
         print("%s: %f" % (optimization, mean_time))

         log.append((optimization, mean_time))


     log = [("numpy", np_running_time / np_repeat)]
     evaluate_addition(fadd, tgt, "naive", log=log)


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     Numpy running time: 0.000007
     naive: 0.000008


 .. GENERATED FROM PYTHON SOURCE LINES 191-203

 Updating the Schedule to Use Parallelism
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Now that we've illustrated the fundamentals of TE, let's go deeper into what
 schedules do, and how they can be used to optimize tensor expressions for
 different architectures. A schedule is a series of steps that are applied to
 an expression to transform it in a number of different ways. When a schedule
 is applied to an expression in TE, the inputs and outputs remain the same,
 but when compiled the implementation of the expression can change. This
 tensor addition, in the default schedule, is run serially but is easy to
 parallelize across all of the processor threads. We can apply the parallel
 schedule operation to our computation.

 .. GENERATED FROM PYTHON SOURCE LINES 203-206

 .. code-block:: default


     s[C].parallel(C.op.axis[0])


 .. GENERATED FROM PYTHON SOURCE LINES 207-212

 The ``tvm.lower`` command will generate the Intermediate Representation (IR)
 of the TE, with the corresponding schedule. By lowering the expression as we
 apply different schedule operations, we can see the effect of scheduling on
 the ordering of the computation. We use the flag ``simple_mode=True`` to
 return a readable C-style statement.

 .. GENERATED FROM PYTHON SOURCE LINES 212-215

 .. code-block:: default


     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.handle, B: T.handle, C: T.handle):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             n = T.int32()
             A_1 = T.match_buffer(A, (n,), strides=("stride",), buffer_type="auto")
             B_1 = T.match_buffer(B, (n,), strides=("stride",), buffer_type="auto")
             C_1 = T.match_buffer(C, (n,), strides=("stride",), buffer_type="auto")
             for i in T.parallel(n):
                 C_2 = T.Buffer((C_1.strides[0] * n,), data=C_1.data, buffer_type="auto")
                 A_2 = T.Buffer((A_1.strides[0] * n,), data=A_1.data, buffer_type="auto")
                 B_2 = T.Buffer((B_1.strides[0] * n,), data=B_1.data, buffer_type="auto")
                 C_2[i * C_1.strides[0]] = A_2[i * A_1.strides[0]] + B_2[i * B_1.strides[0]]


 .. GENERATED FROM PYTHON SOURCE LINES 216-218

 It's now possible for TVM to run these blocks on independent threads. Let's
 compile and run this new schedule with the parallel operation applied:

 .. GENERATED FROM PYTHON SOURCE LINES 218-226

 .. code-block:: default


     fadd_parallel = tvm.build(s, [A, B, C], tgt, name="myadd_parallel")
     fadd_parallel(a, b, c)

     tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())

     evaluate_addition(fadd_parallel, tgt, "parallel", log=log)


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     parallel: 0.000008


 .. GENERATED FROM PYTHON SOURCE LINES 227-237

 Updating the Schedule to Use Vectorization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Modern CPUs also have the ability to perform SIMD operations on floating
 point values, and we can apply another schedule to our computation expression
 to take advantage of this. Accomplishing this requires multiple steps: first
 we have to split the schedule into inner and outer loops using the split
 scheduling primitive. The inner loops can use vectorization to use SIMD
 instructions using the vectorize scheduling primitive, then the outer loops
 can be parallelized using the parallel scheduling primitive. Choose the split
 factor to be the number of threads on your CPU.

 .. GENERATED FROM PYTHON SOURCE LINES 237-262

 .. code-block:: default


     # Recreate the schedule, since we modified it with the parallel operation in
     # the previous example
     n = te.var("n")
     A = te.placeholder((n,), name="A")
     B = te.placeholder((n,), name="B")
     C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")

     s = te.create_schedule(C.op)

     # This factor should be chosen to match the number of threads appropriate for
     # your CPU. This will vary depending on architecture, but a good rule is
     # setting this factor to equal the number of available CPU cores.
     factor = 4

     outer, inner = s[C].split(C.op.axis[0], factor=factor)
     s[C].parallel(outer)
     s[C].vectorize(inner)

     fadd_vector = tvm.build(s, [A, B, C], tgt, name="myadd_parallel")

     evaluate_addition(fadd_vector, tgt, "vector", log=log)

     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     vector: 0.000040
     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.handle, B: T.handle, C: T.handle):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             n = T.int32()
             A_1 = T.match_buffer(A, (n,), strides=("stride",), buffer_type="auto")
             B_1 = T.match_buffer(B, (n,), strides=("stride",), buffer_type="auto")
             C_1 = T.match_buffer(C, (n,), strides=("stride",), buffer_type="auto")
             for i_outer in T.parallel((n + 3) // 4):
                 for i_inner_s in range(4):
                     if T.likely(i_outer * 4 + i_inner_s < n):
                         C_2 = T.Buffer((C_1.strides[0] * n,), data=C_1.data, buffer_type="auto")
                         A_2 = T.Buffer((A_1.strides[0] * n,), data=A_1.data, buffer_type="auto")
                         B_2 = T.Buffer((B_1.strides[0] * n,), data=B_1.data, buffer_type="auto")
                         cse_var_1: T.int32 = i_outer * 4 + i_inner_s
                         C_2[cse_var_1 * C_1.strides[0]] = A_2[cse_var_1 * A_1.strides[0]] + B_2[cse_var_1 * B_1.strides[0]]


 .. GENERATED FROM PYTHON SOURCE LINES 263-266

 Comparing the Different Schedules
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 We can now compare the different schedules

 .. GENERATED FROM PYTHON SOURCE LINES 266-276

 .. code-block:: default


     baseline = log[0][1]
     print("%s\t%s\t%s" % ("Operator".rjust(20), "Timing".rjust(20), "Performance".rjust(20)))
     for result in log:
         print(
             "%s\t%s\t%s"
             % (result[0].rjust(20), str(result[1]).rjust(20), str(result[1] / baseline).rjust(20))
         )


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

                 Operator                  Timing             Performance
                    numpy    6.762330001492956e-06                    1.0
                    naive    7.801099999999999e-06     1.1536112550374957
                 parallel              8.3665e-06       1.237221489952854
                   vector    3.9717599999999994e-05     5.873360216261455


 .. GENERATED FROM PYTHON SOURCE LINES 277-292

 .. admonition:: Code Specialization

   As you may have noticed, the declarations of ``A``, ``B`` and ``C`` all
   take the same shape argument, ``n``. TVM will take advantage of this to
   pass only a single shape argument to the kernel, as you will find in the
   printed device code. This is one form of specialization.

   On the host side, TVM will automatically generate check code that checks
   the constraints in the parameters. So if you pass arrays with different
   shapes into fadd, an error will be raised.

   We can do more specializations. For example, we can write :code:`n =
   tvm.runtime.convert(1024)` instead of :code:`n = te.var("n")`, in the
   computation declaration. The generated function will only take vectors with
   length 1024.

 .. GENERATED FROM PYTHON SOURCE LINES 294-297

 We've defined, scheduled, and compiled a vector addition operator, which we
 were then able to execute on the TVM runtime. We can save the operator as a
 library, which we can then load later using the TVM runtime.

 .. GENERATED FROM PYTHON SOURCE LINES 299-303

 Targeting Vector Addition for GPUs (Optional)
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 TVM is capable of targeting multiple architectures. In the next example, we
 will target compilation of the vector addition to GPUs.

 .. GENERATED FROM PYTHON SOURCE LINES 303-394

 .. code-block:: default


     # If you want to run this code, change ``run_cuda = True``
     # Note that by default this example is not run in the docs CI.

     run_cuda = False
     if run_cuda:
         # Change this target to the correct backend for you gpu. For example: cuda (NVIDIA GPUs),
         # rocm (Radeon GPUS), OpenCL (opencl).
         tgt_gpu = tvm.target.Target(target="cuda", host="llvm")

         # Recreate the schedule
         n = te.var("n")
         A = te.placeholder((n,), name="A")
         B = te.placeholder((n,), name="B")
         C = te.compute(A.shape, lambda i: A[i] + B[i], name="C")
         print(type(C))

         s = te.create_schedule(C.op)

         bx, tx = s[C].split(C.op.axis[0], factor=64)

         ################################################################################
         # Finally we must bind the iteration axis bx and tx to threads in the GPU
         # compute grid. The naive schedule is not valid for GPUs, and these are
         # specific constructs that allow us to generate code that runs on a GPU.

         s[C].bind(bx, te.thread_axis("blockIdx.x"))
         s[C].bind(tx, te.thread_axis("threadIdx.x"))

         ######################################################################
         # Compilation
         # -----------
         # After we have finished specifying the schedule, we can compile it
         # into a TVM function. By default TVM compiles into a type-erased
         # function that can be directly called from the python side.
         #
         # In the following line, we use tvm.build to create a function.
         # The build function takes the schedule, the desired signature of the
         # function (including the inputs and outputs) as well as target language
         # we want to compile to.
         #
         # The result of compilation fadd is a GPU device function (if GPU is
         # involved) as well as a host wrapper that calls into the GPU
         # function. fadd is the generated host wrapper function, it contains
         # a reference to the generated device function internally.

         fadd = tvm.build(s, [A, B, C], target=tgt_gpu, name="myadd")

         ################################################################################
         # The compiled TVM function exposes a concise C API that can be invoked from
         # any language.
         #
         # We provide a minimal array API in python to aid quick testing and prototyping.
         # The array API is based on the `DLPack <https://github.com/dmlc/dlpack>`_ standard.
         #
         # - We first create a GPU device.
         # - Then tvm.nd.array copies the data to the GPU.
         # - ``fadd`` runs the actual computation
         # - ``numpy()`` copies the GPU array back to the CPU (so we can verify correctness).
         #
         # Note that copying the data to and from the memory on the GPU is a required step.

         dev = tvm.device(tgt_gpu.kind.name, 0)

         n = 1024
         a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)
         b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)
         c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)
         fadd(a, b, c)
         tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())

         ################################################################################
         # Inspect the Generated GPU Code
         # ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
         # You can inspect the generated code in TVM. The result of tvm.build is a TVM
         # Module. fadd is the host module that contains the host wrapper, it also
         # contains a device module for the CUDA (GPU) function.
         #
         # The following code fetches the device module and prints the content code.

         if (
             tgt_gpu.kind.name == "cuda"
             or tgt_gpu.kind.name == "rocm"
             or tgt_gpu.kind.name.startswith("opencl")
         ):
             dev_module = fadd.imported_modules[0]
             print("-----GPU code-----")
             print(dev_module.get_source())
         else:
             print(fadd.get_source())


 .. GENERATED FROM PYTHON SOURCE LINES 395-405

 Saving and Loading Compiled Modules
 -----------------------------------
 Besides runtime compilation, we can save the compiled modules into a file and
 load them back later.

 The following code first performs the following steps:

 - It saves the compiled host module into an object file.
 - Then it saves the device module into a ptx file.
 - cc.create_shared calls a compiler (gcc) to create a shared library

 .. GENERATED FROM PYTHON SOURCE LINES 405-420

 .. code-block:: default


     from tvm.contrib import cc
     from tvm.contrib import utils

     temp = utils.tempdir()
     fadd.save(temp.relpath("myadd.o"))
     if tgt.kind.name == "cuda":
         fadd.imported_modules[0].save(temp.relpath("myadd.ptx"))
     if tgt.kind.name == "rocm":
         fadd.imported_modules[0].save(temp.relpath("myadd.hsaco"))
     if tgt.kind.name.startswith("opencl"):
         fadd.imported_modules[0].save(temp.relpath("myadd.cl"))
     cc.create_shared(temp.relpath("myadd.so"), [temp.relpath("myadd.o")])
     print(temp.listdir())


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     ['myadd.o', 'myadd.so']


 .. GENERATED FROM PYTHON SOURCE LINES 421-427

 .. admonition:: Module Storage Format

   The CPU (host) module is directly saved as a shared library (.so). There
   can be multiple customized formats of the device code. In our example, the
   device code is stored in ptx, as well as a meta data json file. They can be
   loaded and linked separately via import.

 .. GENERATED FROM PYTHON SOURCE LINES 429-434

 Load Compiled Module
 ~~~~~~~~~~~~~~~~~~~~
 We can load the compiled module from the file system and run the code. The
 following code loads the host and device module separately and links them
 together. We can verify that the newly loaded function works.

 .. GENERATED FROM PYTHON SOURCE LINES 434-451

 .. code-block:: default


     fadd1 = tvm.runtime.load_module(temp.relpath("myadd.so"))
     if tgt.kind.name == "cuda":
         fadd1_dev = tvm.runtime.load_module(temp.relpath("myadd.ptx"))
         fadd1.import_module(fadd1_dev)

     if tgt.kind.name == "rocm":
         fadd1_dev = tvm.runtime.load_module(temp.relpath("myadd.hsaco"))
         fadd1.import_module(fadd1_dev)

     if tgt.kind.name.startswith("opencl"):
         fadd1_dev = tvm.runtime.load_module(temp.relpath("myadd.cl"))
         fadd1.import_module(fadd1_dev)

     fadd1(a, b, c)
     tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())


 .. GENERATED FROM PYTHON SOURCE LINES 452-458

 Pack Everything into One Library
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 In the above example, we store the device and host code separately. TVM also
 supports export everything as one shared library. Under the hood, we pack
 the device modules into binary blobs and link them together with the host
 code. Currently we support packing of Metal, OpenCL and CUDA modules.

 .. GENERATED FROM PYTHON SOURCE LINES 458-464

 .. code-block:: default


     fadd.export_library(temp.relpath("myadd_pack.so"))
     fadd2 = tvm.runtime.load_module(temp.relpath("myadd_pack.so"))
     fadd2(a, b, c)
     tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())


 .. GENERATED FROM PYTHON SOURCE LINES 465-474

 .. admonition:: Runtime API and Thread-Safety

   The compiled modules of TVM do not depend on the TVM compiler. Instead,
   they only depend on a minimum runtime library. The TVM runtime library
   wraps the device drivers and provides thread-safe and device agnostic calls
   into the compiled functions.

   This means that you can call the compiled TVM functions from any thread, on
   any GPUs, provided that you have compiled the code for that GPU.

 .. GENERATED FROM PYTHON SOURCE LINES 476-483

 Generate OpenCL Code
 --------------------
 TVM provides code generation features into multiple backends. We can also
 generate OpenCL code or LLVM code that runs on CPU backends.

 The following code blocks generate OpenCL code, creates array on an OpenCL
 device, and verifies the correctness of the code.

 .. GENERATED FROM PYTHON SOURCE LINES 483-496

 .. code-block:: default


     if tgt.kind.name.startswith("opencl"):
         fadd_cl = tvm.build(s, [A, B, C], tgt, name="myadd")
         print("------opencl code------")
         print(fadd_cl.imported_modules[0].get_source())
         dev = tvm.cl(0)
         n = 1024
         a = tvm.nd.array(np.random.uniform(size=n).astype(A.dtype), dev)
         b = tvm.nd.array(np.random.uniform(size=n).astype(B.dtype), dev)
         c = tvm.nd.array(np.zeros(n, dtype=C.dtype), dev)
         fadd_cl(a, b, c)
         tvm.testing.assert_allclose(c.numpy(), a.numpy() + b.numpy())


 .. GENERATED FROM PYTHON SOURCE LINES 497-518

 .. admonition:: TE Scheduling Primitives

   TVM includes a number of different scheduling primitives:

   - split: splits a specified axis into two axises by the defined factor.
   - tile: tiles will split a computation across two axes by the defined factors.
   - fuse: fuses two consecutive axises of one computation.
   - reorder: can reorder the axises of a computation into a defined order.
   - bind: can bind a computation to a specific thread, useful in GPU programming.
   - compute_at: by default, TVM will compute tensors at the outermost level
     of the function, or the root, by default. compute_at specifies that one
     tensor should be computed at the first axis of computation for another
     operator.
   - compute_inline: when marked inline, a computation will be expanded then
     inserted into the address where the tensor is required.
   - compute_root: moves a computation to the outermost layer, or root, of the
     function. This means that stage of the computation will be fully computed
     before it moves on to the next stage.

   A complete description of these primitives can be found in the
   :ref:`Schedule Primitives <schedule_primitives>` docs page.

 .. GENERATED FROM PYTHON SOURCE LINES 520-545

 Example 2: Manually Optimizing Matrix Multiplication with TE
 ------------------------------------------------------------

 Now we will consider a second, more advanced example, demonstrating how with
 just 18 lines of python code TVM speeds up a common matrix multiplication operation by 18x.

 **Matrix multiplication is a compute intensive operation. There are
 two important optimizations for good CPU performance:**

 1. Increase the cache hit rate of memory access. Both complex
    numerical computation and hot-spot memory access can be
    accelerated by a high cache hit rate. This requires us to
    transform the origin memory access pattern to a pattern that fits
    the cache policy.

 2. SIMD (Single instruction multi-data), also known as the vector
    processing unit. On each cycle instead of processing a single
    value, SIMD can process a small batch of data.  This requires us
    to transform the data access pattern in the loop body in uniform
    pattern so that the LLVM backend can lower it to SIMD.

 The techniques used in this tutorial are a subset of tricks mentioned in this
 `repository <https://github.com/flame/how-to-optimize-gemm>`_. Some of them
 have been applied by TVM abstraction automatically, but some of them cannot
 be automatically applied due to TVM constraints.

 .. GENERATED FROM PYTHON SOURCE LINES 547-552

 Preparation and Performance Baseline
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 We begin by collecting performance data on the `numpy` implementation of
 matrix multiplication.

 .. GENERATED FROM PYTHON SOURCE LINES 552-601

 .. code-block:: default


     import tvm
     import tvm.testing
     from tvm import te
     import numpy

     # The size of the matrix
     # (M, K) x (K, N)
     # You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL.
     M = 1024
     K = 1024
     N = 1024

     # The default tensor data type in tvm
     dtype = "float32"

     # You will want to adjust the target to match any CPU vector extensions you
     # might have. For example, if you're using using Intel AVX2 (Advanced Vector
     # Extensions) ISA for SIMD, you can get the best performance by changing the
     # following line to ``llvm -mcpu=core-avx2``, or specific type of CPU you use.
     # Recall that you're using llvm, you can get this information from the command
     # ``llc --version`` to get the CPU type, and you can check ``/proc/cpuinfo``
     # for additional extensions that your processor might support.

     target = tvm.target.Target(target="llvm", host="llvm")
     dev = tvm.device(target.kind.name, 0)

     # Random generated tensor for testing
     a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), dev)
     b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), dev)

     # Repeatedly perform a matrix multiplication to get a performance baseline
     # for the default numpy implementation
     np_repeat = 100
     np_running_time = timeit.timeit(
         setup="import numpy\n"
         "M = " + str(M) + "\n"
         "K = " + str(K) + "\n"
         "N = " + str(N) + "\n"
         'dtype = "float32"\n'
         "a = numpy.random.rand(M, K).astype(dtype)\n"
         "b = numpy.random.rand(K, N).astype(dtype)\n",
         stmt="answer = numpy.dot(a, b)",
         number=np_repeat,
     )
     print("Numpy running time: %f" % (np_running_time / np_repeat))

     answer = numpy.dot(a.numpy(), b.numpy())


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     Numpy running time: 0.019337


 .. GENERATED FROM PYTHON SOURCE LINES 602-606

 Now we write a basic matrix multiplication using TVM TE and verify that it
 produces the same results as the numpy implementation. We also write a
 function that will help us measure the performance of the schedule
 optimizations.

 .. GENERATED FROM PYTHON SOURCE LINES 606-640

 .. code-block:: default


     # TVM Matrix Multiplication using TE
     k = te.reduce_axis((0, K), "k")
     A = te.placeholder((M, K), name="A")
     B = te.placeholder((K, N), name="B")
     C = te.compute((M, N), lambda x, y: te.sum(A[x, k] * B[k, y], axis=k), name="C")

     # Default schedule
     s = te.create_schedule(C.op)
     func = tvm.build(s, [A, B, C], target=target, name="mmult")

     c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
     func(a, b, c)
     tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)


     def evaluate_operation(s, vars, target, name, optimization, log):
         func = tvm.build(s, [A, B, C], target=target, name="mmult")
         assert func

         c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev)
         func(a, b, c)
         tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5)

         evaluator = func.time_evaluator(func.entry_name, dev, number=10)
         mean_time = evaluator(a, b, c).mean
         print("%s: %f" % (optimization, mean_time))
         log.append((optimization, mean_time))


     log = []

     evaluate_operation(s, [A, B, C], target=target, name="mmult", optimization="none", log=log)


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     none: 3.441145


 .. GENERATED FROM PYTHON SOURCE LINES 641-645

 Let's take a look at the intermediate representation of the operator and
 default schedule using the TVM lower function. Note how the implementation is
 essentially a naive implementation of a matrix multiplication, using three
 nested loops over the indices of the A and B matrices.

 .. GENERATED FROM PYTHON SOURCE LINES 645-648

 .. code-block:: default


     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             for x, y in T.grid(1024, 1024):
                 C_1 = T.Buffer((1048576,), data=C.data)
                 C_1[x * 1024 + y] = T.float32(0)
                 for k in range(1024):
                     cse_var_2: T.int32 = x * 1024
                     cse_var_1: T.int32 = cse_var_2 + y
                     A_1 = T.Buffer((1048576,), data=A.data)
                     B_1 = T.Buffer((1048576,), data=B.data)
                     C_1[cse_var_1] = C_1[cse_var_1] + A_1[cse_var_2 + k] * B_1[k * 1024 + y]


 .. GENERATED FROM PYTHON SOURCE LINES 649-673

 Optimization 1: Blocking
 ~~~~~~~~~~~~~~~~~~~~~~~~

 A important trick to enhance the cache hit rate is blocking, where you
 structure memory access such that the inside a block is a small neighborhood
 that has high memory locality. In this tutorial, we pick a block factor of
 32. This will result in a block that will fill a 32 * 32 * sizeof(float) area
 of memory. This corresponds to a cache size of 4KB, in relation to a
 reference cache size of 32 KB for L1 cache.

 We begin by creating a default schedule for the ``C`` operation, then apply a
 ``tile`` scheduling primitive to it with the specified block factor, with the
 scheduling primitive returning the resulting loop order from outermost to
 innermost, as a vector ``[x_outer, y_outer, x_inner, y_inner]``. We then get
 the reduction axis for output of the operation, and perform a split operation
 on it using a factor of 4. This factor doesn't directly impact the blocking
 optimization we're working on right now, but will be useful later when we
 apply vectorization.

 Now that the operation has been blocked, we can reorder the computation to
 put the reduction operation into the outermost loop of the computation,
 helping to guarantee that the blocked data remains in cache. This completes
 the schedule, and we can build and test the performance compared to the naive
 schedule.

 .. GENERATED FROM PYTHON SOURCE LINES 673-686

 .. code-block:: default


     bn = 32

     # Blocking by loop tiling
     xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
     (k,) = s[C].op.reduce_axis
     ko, ki = s[C].split(k, factor=4)

     # Hoist reduction domain outside the blocking loop
     s[C].reorder(xo, yo, ko, ki, xi, yi)

     evaluate_operation(s, [A, B, C], target=target, name="mmult", optimization="blocking", log=log)


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     blocking: 0.321478


 .. GENERATED FROM PYTHON SOURCE LINES 687-690

 By reordering the computation to take advantage of caching, you should see a
 significant improvement in the performance of the computation. Now, print the
 internal representation and compare it to the original:

 .. GENERATED FROM PYTHON SOURCE LINES 690-693

 .. code-block:: default


     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             for x_outer, y_outer in T.grid(32, 32):
                 C_1 = T.Buffer((1048576,), data=C.data)
                 for x_inner_init, y_inner_init in T.grid(32, 32):
                     C_1[x_outer * 32768 + x_inner_init * 1024 + y_outer * 32 + y_inner_init] = T.float32(0)
                 for k_outer, k_inner, x_inner, y_inner in T.grid(256, 4, 32, 32):
                     cse_var_3: T.int32 = y_outer * 32
                     cse_var_2: T.int32 = x_outer * 32768 + x_inner * 1024
                     cse_var_1: T.int32 = cse_var_2 + cse_var_3 + y_inner
                     A_1 = T.Buffer((1048576,), data=A.data)
                     B_1 = T.Buffer((1048576,), data=B.data)
                     C_1[cse_var_1] = C_1[cse_var_1] + A_1[cse_var_2 + k_outer * 4 + k_inner] * B_1[k_outer * 4096 + k_inner * 1024 + cse_var_3 + y_inner]


 .. GENERATED FROM PYTHON SOURCE LINES 694-705

 Optimization 2: Vectorization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Another important optimization trick is vectorization. When the memory access
 pattern is uniform, the compiler can detect this pattern and pass the
 continuous memory to the SIMD vector processor. In TVM, we can use the
 ``vectorize`` interface to hint the compiler this pattern, taking advantage
 of this hardware feature.

 In this tutorial, we chose to vectorize the inner loop row data since it is
 already cache friendly from our previous optimizations.

 .. GENERATED FROM PYTHON SOURCE LINES 705-714

 .. code-block:: default


     # Apply the vectorization optimization
     s[C].vectorize(yi)

     evaluate_operation(s, [A, B, C], target=target, name="mmult", optimization="vectorization", log=log)

     # The generalized IR after vectorization
     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     vectorization: 0.315512
     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             for x_outer, y_outer in T.grid(32, 32):
                 C_1 = T.Buffer((1048576,), data=C.data)
                 for x_inner_init in range(32):
                     C_1[x_outer * 32768 + x_inner_init * 1024 + y_outer * 32:x_outer * 32768 + x_inner_init * 1024 + y_outer * 32 + 32] = T.Broadcast(T.float32(0), 32)
                 for k_outer, k_inner, x_inner in T.grid(256, 4, 32):
                     cse_var_3: T.int32 = y_outer * 32
                     cse_var_2: T.int32 = x_outer * 32768 + x_inner * 1024
                     cse_var_1: T.int32 = cse_var_2 + cse_var_3
                     A_1 = T.Buffer((1048576,), data=A.data)
                     B_1 = T.Buffer((1048576,), data=B.data)
                     C_1[cse_var_1:cse_var_1 + 32] = C_1[cse_var_1:cse_var_1 + 32] + T.Broadcast(A_1[cse_var_2 + k_outer * 4 + k_inner], 32) * B_1[k_outer * 4096 + k_inner * 1024 + cse_var_3:k_outer * 4096 + k_inner * 1024 + cse_var_3 + 32]


 .. GENERATED FROM PYTHON SOURCE LINES 715-725

 Optimization 3: Loop Permutation
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 If we look at the above IR, we can see the inner loop row data is vectorized
 and B is transformed into PackedB (this is evident by the `(float32x32*)B2`
 portion of the inner loop). The traversal of PackedB is sequential now. So we
 will look at the access pattern of A. In current schedule, A is accessed
 column by column which is not cache friendly. If we change the nested loop
 order of `ki` and inner axes `xi`, the access pattern for A matrix will be
 more cache friendly.

 .. GENERATED FROM PYTHON SOURCE LINES 725-742

 .. code-block:: default


     s = te.create_schedule(C.op)
     xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
     (k,) = s[C].op.reduce_axis
     ko, ki = s[C].split(k, factor=4)

     # re-ordering
     s[C].reorder(xo, yo, ko, xi, ki, yi)
     s[C].vectorize(yi)

     evaluate_operation(
         s, [A, B, C], target=target, name="mmult", optimization="loop permutation", log=log
     )

     # Again, print the new generalized IR
     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     loop permutation: 0.125845
     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             for x_outer, y_outer in T.grid(32, 32):
                 C_1 = T.Buffer((1048576,), data=C.data)
                 for x_inner_init in range(32):
                     C_1[x_outer * 32768 + x_inner_init * 1024 + y_outer * 32:x_outer * 32768 + x_inner_init * 1024 + y_outer * 32 + 32] = T.Broadcast(T.float32(0), 32)
                 for k_outer, x_inner, k_inner in T.grid(256, 32, 4):
                     cse_var_3: T.int32 = y_outer * 32
                     cse_var_2: T.int32 = x_outer * 32768 + x_inner * 1024
                     cse_var_1: T.int32 = cse_var_2 + cse_var_3
                     A_1 = T.Buffer((1048576,), data=A.data)
                     B_1 = T.Buffer((1048576,), data=B.data)
                     C_1[cse_var_1:cse_var_1 + 32] = C_1[cse_var_1:cse_var_1 + 32] + T.Broadcast(A_1[cse_var_2 + k_outer * 4 + k_inner], 32) * B_1[k_outer * 4096 + k_inner * 1024 + cse_var_3:k_outer * 4096 + k_inner * 1024 + cse_var_3 + 32]


 .. GENERATED FROM PYTHON SOURCE LINES 743-768

 Optimization 4: Array Packing
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Another important trick is array packing. This trick is to reorder the
 storage dimension of the array to convert the continuous access pattern on
 certain dimension to a sequential pattern after flattening.

 .. image:: https://github.com/dmlc/web-data/raw/main/tvm/tutorial/array-packing.png
    :align: center

 Just as it is shown in the figure above, after blocking the computations, we
 can observe the array access pattern of B (after flattening), which is
 regular but discontinuous. We expect that after some transformation we can
 get a continuous access pattern. By reordering a ``[16][16]`` array to a
 ``[16/4][16][4]`` array the access pattern of B will be sequential when
 grabbing the corresponding value from the packed array.

 To accomplish this, we are going to have to start with a new default
 schedule, taking into account the new packing of B. It's worth taking a
 moment to comment on this: TE is a powerful and expressive language for
 writing optimized operators, but it often requires some knowledge of the
 underlying algorithm, data structures, and hardware target that you are
 writing for. Later in the tutorial, we will discuss some of the options for
 letting TVM take that burden. Regardless, let's move on with the new
 optimized schedule.

 .. GENERATED FROM PYTHON SOURCE LINES 768-795

 .. code-block:: default


     # We have to re-write the algorithm slightly.
     packedB = te.compute((N / bn, K, bn), lambda x, y, z: B[y, x * bn + z], name="packedB")
     C = te.compute(
         (M, N),
         lambda x, y: te.sum(A[x, k] * packedB[y // bn, k, tvm.tir.indexmod(y, bn)], axis=k),
         name="C",
     )

     s = te.create_schedule(C.op)

     xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)
     (k,) = s[C].op.reduce_axis
     ko, ki = s[C].split(k, factor=4)

     s[C].reorder(xo, yo, ko, xi, ki, yi)
     s[C].vectorize(yi)

     x, y, z = s[packedB].op.axis
     s[packedB].vectorize(z)
     s[packedB].parallel(x)

     evaluate_operation(s, [A, B, C], target=target, name="mmult", optimization="array packing", log=log)

     # Here is the generated IR after array packing.
     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     array packing: 0.108632
     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             packedB = T.allocate([32768], "float32x32", "global")
             packedB_1 = T.Buffer((32768,), "float32x32", data=packedB)
             for x in T.parallel(32):
                 for y in range(1024):
                     B_1 = T.Buffer((1048576,), data=B.data)
                     packedB_1[x * 1024 + y] = B_1[y * 1024 + x * 32:y * 1024 + x * 32 + 32]
             for x_outer, y_outer in T.grid(32, 32):
                 C_1 = T.Buffer((1048576,), data=C.data)
                 for x_inner_init in range(32):
                     C_1[x_outer * 32768 + x_inner_init * 1024 + y_outer * 32:x_outer * 32768 + x_inner_init * 1024 + y_outer * 32 + 32] = T.Broadcast(T.float32(0), 32)
                 for k_outer, x_inner, k_inner in T.grid(256, 32, 4):
                     cse_var_3: T.int32 = x_outer * 32768 + x_inner * 1024
                     cse_var_2: T.int32 = k_outer * 4
                     cse_var_1: T.int32 = cse_var_3 + y_outer * 32
                     A_1 = T.Buffer((1048576,), data=A.data)
                     C_1[cse_var_1:cse_var_1 + 32] = C_1[cse_var_1:cse_var_1 + 32] + T.Broadcast(A_1[cse_var_3 + cse_var_2 + k_inner], 32) * packedB_1[y_outer * 1024 + cse_var_2 + k_inner]


 .. GENERATED FROM PYTHON SOURCE LINES 796-806

 Optimization 5: Optimizing Block Writing Through Caching
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 Up to this point all of our optimizations have focused on efficiently
 accessing and computing the data from the `A` and `B` matrices to compute the
 `C` matrix. After the blocking optimization, the operator will write result
 to `C` block by block, and the access pattern is not sequential. We can
 address this by using a sequential cache array, using a combination of
 `cache_write`, `compute_at`, and `unroll`to hold the block results and write
 to `C` when all the block results are ready.

 .. GENERATED FROM PYTHON SOURCE LINES 806-835

 .. code-block:: default


     s = te.create_schedule(C.op)

     # Allocate write cache
     CC = s.cache_write(C, "global")

     xo, yo, xi, yi = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn)

     # Write cache is computed at yo
     s[CC].compute_at(s[C], yo)

     # New inner axes
     xc, yc = s[CC].op.axis

     (k,) = s[CC].op.reduce_axis
     ko, ki = s[CC].split(k, factor=4)
     s[CC].reorder(ko, xc, ki, yc)
     s[CC].unroll(ki)
     s[CC].vectorize(yc)

     x, y, z = s[packedB].op.axis
     s[packedB].vectorize(z)
     s[packedB].parallel(x)

     evaluate_operation(s, [A, B, C], target=target, name="mmult", optimization="block caching", log=log)

     # Here is the generated IR after write cache blocking.
     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     block caching: 0.112279
     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             packedB = T.allocate([32768], "float32x32", "global")
             C_global = T.allocate([1024], "float32", "global")
             packedB_1 = T.Buffer((32768,), "float32x32", data=packedB)
             for x in T.parallel(32):
                 for y in range(1024):
                     B_1 = T.Buffer((1048576,), data=B.data)
                     packedB_1[x * 1024 + y] = B_1[y * 1024 + x * 32:y * 1024 + x * 32 + 32]
             for x_outer, y_outer in T.grid(32, 32):
                 C_global_1 = T.Buffer((1024,), data=C_global)
                 for x_c_init in range(32):
                     C_global_1[x_c_init * 32:x_c_init * 32 + 32] = T.Broadcast(T.float32(0), 32)
                 for k_outer, x_c in T.grid(256, 32):
                     cse_var_4: T.int32 = k_outer * 4
                     cse_var_3: T.int32 = x_c * 32
                     cse_var_2: T.int32 = y_outer * 1024 + cse_var_4
                     cse_var_1: T.int32 = x_outer * 32768 + x_c * 1024 + cse_var_4
                     A_1 = T.Buffer((1048576,), data=A.data)
                     C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1], 32) * packedB_1[cse_var_2]
                     C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 1], 32) * packedB_1[cse_var_2 + 1]
                     C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 2], 32) * packedB_1[cse_var_2 + 2]
                     C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 3], 32) * packedB_1[cse_var_2 + 3]
                 for x_inner, y_inner in T.grid(32, 32):
                     C_1 = T.Buffer((1048576,), data=C.data)
                     C_1[x_outer * 32768 + x_inner * 1024 + y_outer * 32 + y_inner] = C_global_1[x_inner * 32 + y_inner]


 .. GENERATED FROM PYTHON SOURCE LINES 836-843

 Optimization 6: Parallelization
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 So far, our computation is only designed to use a single core. Nearly all
 modern processors have multiple cores, and computation can benefit from
 running computations in parallel. The final optimization is to take advantage
 of thread-level parallelization.

 .. GENERATED FROM PYTHON SOURCE LINES 843-858

 .. code-block:: default


     # parallel
     s[C].parallel(xo)

     x, y, z = s[packedB].op.axis
     s[packedB].vectorize(z)
     s[packedB].parallel(x)

     evaluate_operation(
         s, [A, B, C], target=target, name="mmult", optimization="parallelization", log=log
     )

     # Here is the generated IR after parallelization.
     print(tvm.lower(s, [A, B, C], simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     parallelization: 0.134426
     # from tvm.script import ir as I
     # from tvm.script import tir as T

     @I.ir_module
     class Module:
         @T.prim_func
         def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")):
             T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)})
             packedB = T.allocate([32768], "float32x32", "global")
             packedB_1 = T.Buffer((32768,), "float32x32", data=packedB)
             for x in T.parallel(32):
                 for y in range(1024):
                     B_1 = T.Buffer((1048576,), data=B.data)
                     packedB_1[x * 1024 + y] = B_1[y * 1024 + x * 32:y * 1024 + x * 32 + 32]
             for x_outer in T.parallel(32):
                 C_global = T.allocate([1024], "float32", "global")
                 for y_outer in range(32):
                     C_global_1 = T.Buffer((1024,), data=C_global)
                     for x_c_init in range(32):
                         C_global_1[x_c_init * 32:x_c_init * 32 + 32] = T.Broadcast(T.float32(0), 32)
                     for k_outer, x_c in T.grid(256, 32):
                         cse_var_4: T.int32 = k_outer * 4
                         cse_var_3: T.int32 = x_c * 32
                         cse_var_2: T.int32 = y_outer * 1024 + cse_var_4
                         cse_var_1: T.int32 = x_outer * 32768 + x_c * 1024 + cse_var_4
                         A_1 = T.Buffer((1048576,), data=A.data)
                         C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1], 32) * packedB_1[cse_var_2]
                         C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 1], 32) * packedB_1[cse_var_2 + 1]
                         C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 2], 32) * packedB_1[cse_var_2 + 2]
                         C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 3], 32) * packedB_1[cse_var_2 + 3]
                     for x_inner, y_inner in T.grid(32, 32):
                         C_1 = T.Buffer((1048576,), data=C.data)
                         C_1[x_outer * 32768 + x_inner * 1024 + y_outer * 32 + y_inner] = C_global_1[x_inner * 32 + y_inner]


 .. GENERATED FROM PYTHON SOURCE LINES 859-866

 Summary of Matrix Multiplication Example
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

 After applying the above simple optimizations with only 18 lines of code, our
 generated code can begin to approach the performance of `numpy` with the Math
 Kernel Library (MKL). Since we've been logging the performance as we've been
 working, we can compare the results.

 .. GENERATED FROM PYTHON SOURCE LINES 866-875

 .. code-block:: default


     baseline = log[0][1]
     print("%s\t%s\t%s" % ("Operator".rjust(20), "Timing".rjust(20), "Performance".rjust(20)))
     for result in log:
         print(
             "%s\t%s\t%s"
             % (result[0].rjust(20), str(result[1]).rjust(20), str(result[1] / baseline).rjust(20))
         )


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

                 Operator                  Timing             Performance
                     none      3.4411447787999996                     1.0
                 blocking     0.32147756800000005     0.09342169210099498
            vectorization            0.3155115755     0.09168796891191122
         loop permutation     0.12584523630000002    0.036570747349922585
            array packing     0.10863164939999999     0.03156846235277615
            block caching            0.1122794197     0.03262850792902536
          parallelization            0.1344261548     0.03906437056300702


 .. GENERATED FROM PYTHON SOURCE LINES 876-882

 Note that the outputs on the web page reflect the running times on a
 non-exclusive Docker container, and should be considered unreliable. It is
 highly encouraged to run the tutorial by yourself to observe the performance
 gain achieved by TVM, and to carefully work through each example to
 understand the iterative improvements that are made to the matrix
 multiplication operation.

 .. GENERATED FROM PYTHON SOURCE LINES 884-906

 Final Notes and Summary
 -----------------------
 As mentioned earlier, how to apply optimizations using TE and scheduling
 primitives can require some knowledge of the underlying architecture and
 algorithms. However, TE was designed to act as a foundation for more complex
 algorithms that can search the potential optimization. With the knowledge you
 have from this introduction to TE, we can now begin to explore how TVM can
 automate the schedule optimization process.

 This tutorial provided a walk-through of TVM Tensor Expression (TE) workflow
 using a vector add and a matrix multiplication examples. The general workflow
 is

 - Describe your computation via a series of operations.
 - Describe how we want to compute use schedule primitives.
 - Compile to the target function we want.
 - Optionally, save the function to be loaded later.

 Upcoming tutorials expand on the matrix multiplication example, and show how
 you can build generic templates of the matrix multiplication and other
 operations with tunable parameters that allows you to automatically optimize
 the computation for specific platforms.


 .. rst-class:: sphx-glr-timing

    **Total running time of the script:** ( 1 minutes  1.678 seconds)


 .. _sphx_glr_download_tutorial_tensor_expr_get_started.py:

 .. only:: html

   .. container:: sphx-glr-footer sphx-glr-footer-example


     .. container:: sphx-glr-download sphx-glr-download-python

       :download:`Download Python source code: tensor_expr_get_started.py <tensor_expr_get_started.py>`

     .. container:: sphx-glr-download sphx-glr-download-jupyter

       :download:`Download Jupyter notebook: tensor_expr_get_started.ipynb <tensor_expr_get_started.ipynb>`


 .. only:: html

  .. rst-class:: sphx-glr-signature

     `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_