docs/_sources/tutorial/autotvm_matmul_x86.rst.txt - tvm-site - Git at Google


 .. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
 .. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
 .. CHANGES, EDIT THE SOURCE PYTHON FILE:
 .. "tutorial/autotvm_matmul_x86.py"

 .. only:: html

     .. note::
         :class: sphx-glr-download-link-note

         This tutorial can be used interactively with Google Colab! You can also click
         :ref:`here <sphx_glr_download_tutorial_autotvm_matmul_x86.py>` to run the Jupyter notebook locally.

         .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg
             :align: center
             :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb
             :width: 300px

 .. rst-class:: sphx-glr-example-title

 .. _sphx_glr_tutorial_autotvm_matmul_x86.py:


 .. _tutorial-autotvm-matmul-x86:

 Optimizing Operators with Schedule Templates and AutoTVM
 ========================================================
 **Authors**:
 `Lianmin Zheng <https://github.com/merrymercy>`_,
 `Chris Hoge <https://github.com/hogepodge>`_

 In this tutorial, we show how the TVM Tensor Expression (TE) language
 can be used to write schedule templates that can be searched by AutoTVM to
 find the optimal schedule. This process is called Auto-Tuning, which helps
 automate the process of optimizing tensor computation.

 This tutorial builds on the previous :doc:`tutorial on how to write a matrix
 multiplication using TE <tensor_expr_get_started>`.

 There are two steps in auto-tuning.

 - The first step is defining a search space.
 - The second step is running a search algorithm to explore through this space.

 In this tutorial, you can learn how to perform these two steps in TVM. The whole
 workflow is illustrated by a matrix multiplication example.

 .. note::
   Note that this tutorial will not run on Windows or recent versions of macOS.
   To get it to run, you will need to wrap the body of this tutorial in a
   :code:`if __name__ == "__main__":` block.

 .. GENERATED FROM PYTHON SOURCE LINES 49-66

 Install dependencies
 --------------------
 To use autotvm package in TVM, we need to install some extra dependencies.

 .. code-block:: bash

   pip3 install --user psutil xgboost cloudpickle

 To make TVM run faster in tuning, it is recommended to use cython as FFI of
 TVM. In the root directory of TVM, execute:

 .. code-block:: bash

   pip3 install --user cython
   sudo make cython3

 Now return to python code. Begin by importing the required packages.

 .. GENERATED FROM PYTHON SOURCE LINES 66-79

 .. code-block:: default


     import logging
     import sys

     import numpy as np
     import tvm
     from tvm import te
     import tvm.testing

     # the module is called `autotvm`
     from tvm import autotvm


 .. GENERATED FROM PYTHON SOURCE LINES 80-87

 Basic Matrix Multiplication with TE
 -----------------------------------
 Recall the basic implementation of matrix multiplication using TE. We write
 it down here with a few changes. We will wrap the multiplication in a python
 function definition. For simplicity, we will focus our attention on a split
 optimization, using a fixed value that defines the block size of the
 reordering.

 .. GENERATED FROM PYTHON SOURCE LINES 87-110

 .. code-block:: default


     def matmul_basic(N, L, M, dtype):

         A = te.placeholder((N, L), name="A", dtype=dtype)
         B = te.placeholder((L, M), name="B", dtype=dtype)

         k = te.reduce_axis((0, L), name="k")
         C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
         s = te.create_schedule(C.op)

         # schedule
         y, x = s[C].op.axis
         k = s[C].op.reduce_axis[0]

         yo, yi = s[C].split(y, 8)
         xo, xi = s[C].split(x, 8)

         s[C].reorder(yo, xo, k, yi, xi)

         return s, [A, B, C]


 .. GENERATED FROM PYTHON SOURCE LINES 111-124

 Matrix Multiplication with AutoTVM
 ----------------------------------
 In the previous schedule code, we use a constant "8" as the tiling factor.
 However, it might not be the best one because the best tiling factor depends
 on real hardware environment and input shape.

 If you want the schedule code to be portable across a wider range of input
 shapes and target hardware, it is better to define a set of candidate values
 and pick the best one according to the measurement results on target
 hardware.

 In autotvm, we can define a tunable parameter, or a "knob" for such kind of
 value.

 .. GENERATED FROM PYTHON SOURCE LINES 126-130

 A Basic Matrix Multiplication Template
 --------------------------------------
 We begin with an example of how to create a tunable parameter set for the
 block size of the `split` scheduling operation.

 .. GENERATED FROM PYTHON SOURCE LINES 130-161

 .. code-block:: default


     # Matmul V1: List candidate values
     @autotvm.template("tutorial/matmul_v1")  # 1. use a decorator
     def matmul_v1(N, L, M, dtype):
         A = te.placeholder((N, L), name="A", dtype=dtype)
         B = te.placeholder((L, M), name="B", dtype=dtype)

         k = te.reduce_axis((0, L), name="k")
         C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
         s = te.create_schedule(C.op)

         # schedule
         y, x = s[C].op.axis
         k = s[C].op.reduce_axis[0]

         # 2. get the config object
         cfg = autotvm.get_config()

         # 3. define search space
         cfg.define_knob("tile_y", [1, 2, 4, 8, 16])
         cfg.define_knob("tile_x", [1, 2, 4, 8, 16])

         # 4. schedule according to config
         yo, yi = s[C].split(y, cfg["tile_y"].val)
         xo, xi = s[C].split(x, cfg["tile_x"].val)

         s[C].reorder(yo, xo, k, yi, xi)

         return s, [A, B, C]


 .. GENERATED FROM PYTHON SOURCE LINES 162-197

 Here we make four modifications to the previous schedule code and get a
 tunable "template". We can explain the modifications one by one.

 1. Use a decorator to mark this function as a simple template.
 2. Get a config object: You can regard this :code:`cfg` as an argument of
    this function but we obtain it in a different way. With this argument, this
    function is no longer a deterministic schedule. Instead, we can pass
    different configurations to this function and get different schedules. A
    function that uses a configuration object like this is called a "template".

    To make the template function more compact, we can do two things to define
    the parameter search space within a single function.

    1. Define a search space across a set values. This is done by making
       :code:`cfg` a :any:`ConfigSpace` object. It will collect all of the
       tunable knobs in this function and build a search space from it.
    2. Schedule according to an entity in this space. This is done by making
       :code:`cfg` a :any:`ConfigEntity` object. When it is a
       :any:`ConfigEntity`, it will ignore all space definition API (namely,
       :code:`cfg.define_XXXXX(...)`). Instead, it will store deterministic
       values for all tunable knobs, and we schedule according to these values.

    During auto-tuning, we will first call this template with a
    :any:`ConfigSpace` object to build the search space. Then we call this
    template with different :any:`ConfigEntity` in the built space to get
    different schedules. Finally we will measure the code generated by
    different schedules and pick the best one.

 3. Define two tunable knobs. The first one is :code:`tile_y` with 5 possible
    values. The second one is :code:`tile_x` with a same list of possible values.
    These two knobs are independent, so they span a search space with size 25 =
    5x5.
 4. The configuration knobs are passed to the :code:`split` schedule
    operation, allowing us to schedule according to the 5x5 deterministic values
    we previously defined in :code:`cfg`.

 .. GENERATED FROM PYTHON SOURCE LINES 199-215

 A Matrix Multiplication Template with the Advanced Parameter API
 ----------------------------------------------------------------
 In the previous template, we manually listed all of the possible values for a
 knob. This is the lowest level API to define the space, and gives an explicit
 enumeration of the parameter space to search. However, we also provide
 another set of APIs that can make the definition of the search space easier
 and smarter. Where possible, we recommend you use this higher-level API

 In the following example, we use :any:`ConfigSpace.define_split` to define a
 split knob. It will enumerate all the possible ways to split an axis and
 construct the space.

 We also have :any:`ConfigSpace.define_reorder` for reorder knob and
 :any:`ConfigSpace.define_annotate` for annotation like unroll, vectorization,
 thread binding. When the high level API cannot meet your requirements, you
 can always fall back to using the low level API.

 .. GENERATED FROM PYTHON SOURCE LINES 215-245

 .. code-block:: default


     @autotvm.template("tutorial/matmul")
     def matmul(N, L, M, dtype):
         A = te.placeholder((N, L), name="A", dtype=dtype)
         B = te.placeholder((L, M), name="B", dtype=dtype)

         k = te.reduce_axis((0, L), name="k")
         C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
         s = te.create_schedule(C.op)

         # schedule
         y, x = s[C].op.axis
         k = s[C].op.reduce_axis[0]

         ##### define space begin #####
         cfg = autotvm.get_config()
         cfg.define_split("tile_y", y, num_outputs=2)
         cfg.define_split("tile_x", x, num_outputs=2)
         ##### define space end #####

         # schedule according to config
         yo, yi = cfg["tile_y"].apply(s, C, y)
         xo, xi = cfg["tile_x"].apply(s, C, x)

         s[C].reorder(yo, xo, k, yi, xi)

         return s, [A, B, C]


 .. GENERATED FROM PYTHON SOURCE LINES 246-266

 .. admonition:: More Explanation on :code:`cfg.define_split`

  In this template, :code:`cfg.define_split("tile_y", y, num_outputs=2)` will
  enumerate all possible combinations that can split axis y into two axes with
  factors of the length of y. For example, if the length of y is 32 and we
  want to split it into two axes using factors of 32, then there are 6
  possible values for (length of outer axis, length of inner axis) pair,
  namely (32, 1), (16, 2), (8, 4), (4, 8), (2, 16) or (1, 32). These are all 6
  possible values of `tile_y`.

  During scheduling, :code:`cfg["tile_y"]` is a :code:`SplitEntity` object.
  We stores the lengths of outer axes and inner axes in
  :code:`cfg['tile_y'].size` (a tuple with two elements).  In this template,
  we apply it by using :code:`yo, yi = cfg['tile_y'].apply(s, C, y)`.
  Actually, this is equivalent to :code:`yo, yi = s[C].split(y,
  cfg["tile_y"].size[1])` or  :code:`yo, yi = s[C].split(y,
  nparts=cfg['tile_y"].size[0])`

  The advantage of using cfg.apply API is that it makes multi-level splits
  (that is, when num_outputs >= 3) easier.

 .. GENERATED FROM PYTHON SOURCE LINES 268-302

 Step 2: Use AutoTVM to Optimize the Matrix Multiplication
 ---------------------------------------------------------
 In Step 1, we wrote a matrix multiplication template that allowed us to
 parameterize the block size used in the `split` schedule. We can now conduct
 a search over this parameter space. The next step is to pick a tuner to guide
 the exploration of this space.

 Auto-tuners in TVM
 ~~~~~~~~~~~~~~~~~~
 The job for a tuner can be described by following pseudo code

   .. code-block:: c

    ct = 0
    while ct < max_number_of_trials:
        propose a batch of configs
        measure this batch of configs on real hardware and get results
        ct += batch_size

 When proposing the next batch of configs, the tuner can take different
 strategies. Some of the tuner strategies provided by TVM include:

 * :any:`tvm.autotvm.tuner.RandomTuner`: Enumerate the space in a random order
 * :any:`tvm.autotvm.tuner.GridSearchTuner`: Enumerate the space in a grid search order
 * :any:`tvm.autotvm.tuner.GATuner`: Using genetic algorithm to search through the space
 * :any:`tvm.autotvm.tuner.XGBTuner`: Uses a model based method. Train a XGBoost model to
   predict the speed of lowered IR and pick the next batch according to the
   prediction.

 You can choose the tuner according to the size of your space, your time
 budget and other factors.  For example, if your space is very small (less
 than 1000), a grid-search tuner or a random tuner is good enough. If your
 space is at the level of 10^9 (this is the space size of a conv2d operator on
 CUDA GPU), XGBoostTuner can explore more efficiently and find better configs.

 .. GENERATED FROM PYTHON SOURCE LINES 304-310

 Begin tuning
 ~~~~~~~~~~~~
 Here we continue our matrix multiplication example. First we create a tuning
 task. We can also inspect the initialized search space. In this case, for a
 512x512 square matrix multiplication, the space size is 10x10=100 Note that
 the task and search space are independent of the tuner picked.

 .. GENERATED FROM PYTHON SOURCE LINES 310-315

 .. code-block:: default


     N, L, M = 512, 512, 512
     task = autotvm.task.create("tutorial/matmul", args=(N, L, M, "float32"), target="llvm")
     print(task.config_space)


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     ConfigSpace (len=100, range_length=100, space_map=
        0 tile_y: Split(policy=factors, product=512, num_outputs=2) len=10
        1 tile_x: Split(policy=factors, product=512, num_outputs=2) len=10
     )


 .. GENERATED FROM PYTHON SOURCE LINES 316-323

 Then we need to define how to measure the generated code and pick a tuner.
 Since our space is small, a random tuner is just okay.

 We only make 10 trials in this tutorial for demonstration. In practice, you
 can do more trials according to your time budget. We will log the tuning
 results into a log file. This file can be used to choose the best
 configuration discovered by the tuner later.

 .. GENERATED FROM PYTHON SOURCE LINES 323-328

 .. code-block:: default


     # logging config (for printing tuning log to the screen)
     logging.getLogger("autotvm").setLevel(logging.DEBUG)
     logging.getLogger("autotvm").addHandler(logging.StreamHandler(sys.stdout))


 .. GENERATED FROM PYTHON SOURCE LINES 329-332

 There are two steps for measuring a config: build and run. By default, we use
 all CPU cores to compile program. We then measure them sequentially. To help
 reduce variance, we take 5 measurements and average them.

 .. GENERATED FROM PYTHON SOURCE LINES 332-343

 .. code-block:: default

     measure_option = autotvm.measure_option(builder="local", runner=autotvm.LocalRunner(number=5))

     # Begin tuning with RandomTuner, log records to file `matmul.log`
     # You can use alternatives like XGBTuner.
     tuner = autotvm.tuner.RandomTuner(task)
     tuner.tune(
         n_trial=10,
         measure_option=measure_option,
         callbacks=[autotvm.callback.log_to_file("matmul.log")],
     )


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     waiting for device...
     device available
     Get devices for measurement successfully!
     No: 1   GFLOPS: 9.95/9.95       result: MeasureResult(costs=(0.026991350199999996,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7835841178894043, timestamp=1714772753.7806447)       [('tile_y', [-1, 512]), ('tile_x', [-1, 128])],None,79
     No: 2   GFLOPS: 8.75/9.95       result: MeasureResult(costs=(0.0306641994,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7925927639007568, timestamp=1714772754.5419)  [('tile_y', [-1, 128]), ('tile_x', [-1, 16])],None,47
     No: 3   GFLOPS: 10.66/10.66     result: MeasureResult(costs=(0.025178763799999997,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7073245048522949, timestamp=1714772755.2139173)       [('tile_y', [-1, 256]), ('tile_x', [-1, 256])],None,88
     No: 4   GFLOPS: 11.90/11.90     result: MeasureResult(costs=(0.0225620518,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.6628890037536621, timestamp=1714772755.8438463)       [('tile_y', [-1, 8]), ('tile_x', [-1, 256])],None,83
     No: 5   GFLOPS: 3.60/11.90      result: MeasureResult(costs=(0.0746499722,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4612109661102295, timestamp=1714772757.4345112)       [('tile_y', [-1, 2]), ('tile_x', [-1, 2])],None,11
     No: 6   GFLOPS: 3.62/11.90      result: MeasureResult(costs=(0.0740701102,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.471846342086792, timestamp=1714772758.8837745)        [('tile_y', [-1, 128]), ('tile_x', [-1, 8])],None,37
     No: 7   GFLOPS: 10.27/11.90     result: MeasureResult(costs=(0.026150489800000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7033135890960693, timestamp=1714772759.5768602)       [('tile_y', [-1, 256]), ('tile_x', [-1, 512])],None,98
     No: 8   GFLOPS: 10.28/11.90     result: MeasureResult(costs=(0.026110290199999997,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.6939852237701416, timestamp=1714772760.2588608)       [('tile_y', [-1, 2]), ('tile_x', [-1, 32])],None,51
     No: 9   GFLOPS: 0.46/11.90      result: MeasureResult(costs=(0.584163126,), error_no=MeasureErrorNo.NO_ERROR, all_cost=9.571329832077026, timestamp=1714772769.956309)  [('tile_y', [-1, 512]), ('tile_x', [-1, 1])],None,9
     No: 10  GFLOPS: 0.51/11.90      result: MeasureResult(costs=(0.5271286306,), error_no=MeasureErrorNo.NO_ERROR, all_cost=8.672711849212646, timestamp=1714772778.652354) [('tile_y', [-1, 128]), ('tile_x', [-1, 1])],None,7


 .. GENERATED FROM PYTHON SOURCE LINES 344-351

 With tuning completed, we can choose the configuration from the log file that
 has the best measured performance and compile the schedule with the
 corresponding parameters. We also do a quick verification that the schedule is
 producing correct answers.  We can call the function :code:`matmul` directly
 under the :any:`autotvm.apply_history_best` context. When we call this
 function, it will query the dispatch context with its argument and get the
 best config with the same argument.

 .. GENERATED FROM PYTHON SOURCE LINES 351-368

 .. code-block:: default


     # apply history best from log file
     with autotvm.apply_history_best("matmul.log"):
         with tvm.target.Target("llvm"):
             s, arg_bufs = matmul(N, L, M, "float32")
             func = tvm.build(s, arg_bufs)

     # check correctness
     a_np = np.random.uniform(size=(N, L)).astype(np.float32)
     b_np = np.random.uniform(size=(L, M)).astype(np.float32)
     c_np = a_np.dot(b_np)

     c_tvm = tvm.nd.empty(c_np.shape)
     func(tvm.nd.array(a_np), tvm.nd.array(b_np), c_tvm)

     tvm.testing.assert_allclose(c_np, c_tvm.numpy(), rtol=1e-4)


 .. rst-class:: sphx-glr-script-out

  .. code-block:: none

     Finish loading 10 records


 .. GENERATED FROM PYTHON SOURCE LINES 369-379

 Final Notes and Summary
 -----------------------
 In this tutorial, we have shown how to build operator templates that allow
 TVM to search a parameter space and choose optimized schedule configurations.
 To gain a deeper understanding of how this works, we recommend expanding on
 this example by adding new search parameters to the schedule based on
 schedule operations demonstrated in the :ref: `Getting Started With Tensor
 Expressions <tensor_expr_get_started>_` tutorial. In the upcoming sections, we
 will demonstrate the AutoScheduler, a method for TVM to optimize common
 operators without the need for the user to provide a user-defined template.


 .. _sphx_glr_download_tutorial_autotvm_matmul_x86.py:

 .. only:: html

   .. container:: sphx-glr-footer sphx-glr-footer-example


     .. container:: sphx-glr-download sphx-glr-download-python

       :download:`Download Python source code: autotvm_matmul_x86.py <autotvm_matmul_x86.py>`

     .. container:: sphx-glr-download sphx-glr-download-jupyter

       :download:`Download Jupyter notebook: autotvm_matmul_x86.ipynb <autotvm_matmul_x86.ipynb>`


 .. only:: html

  .. rst-class:: sphx-glr-signature

     `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_

	.. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY
	.. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE
	.. CHANGES, EDIT THE SOURCE PYTHON FILE:
	.. "tutorial/autotvm_matmul_x86.py"

	.. only:: html

	.. note::
	:class: sphx-glr-download-link-note

	This tutorial can be used interactively with Google Colab! You can also click
	:ref:`here <sphx_glr_download_tutorial_autotvm_matmul_x86.py>` to run the Jupyter notebook locally.

	.. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg
	:align: center
	:target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/37bbf9e2065ec8deeb64a8d9fa0755bc/autotvm_matmul_x86.ipynb
	:width: 300px

	.. rst-class:: sphx-glr-example-title

	.. _sphx_glr_tutorial_autotvm_matmul_x86.py:


	.. _tutorial-autotvm-matmul-x86:

	Optimizing Operators with Schedule Templates and AutoTVM
	========================================================
	Authors:
	`Lianmin Zheng <https://github.com/merrymercy>`_,
	`Chris Hoge <https://github.com/hogepodge>`_

	In this tutorial, we show how the TVM Tensor Expression (TE) language
	can be used to write schedule templates that can be searched by AutoTVM to
	find the optimal schedule. This process is called Auto-Tuning, which helps
	automate the process of optimizing tensor computation.

	This tutorial builds on the previous :doc:`tutorial on how to write a matrix
	multiplication using TE <tensor_expr_get_started>`.

	There are two steps in auto-tuning.

	- The first step is defining a search space.
	- The second step is running a search algorithm to explore through this space.

	In this tutorial, you can learn how to perform these two steps in TVM. The whole
	workflow is illustrated by a matrix multiplication example.

	.. note::
	Note that this tutorial will not run on Windows or recent versions of macOS.
	To get it to run, you will need to wrap the body of this tutorial in a
	:code:`if __name__ == "__main__":` block.

	.. GENERATED FROM PYTHON SOURCE LINES 49-66

	Install dependencies
	--------------------
	To use autotvm package in TVM, we need to install some extra dependencies.

	.. code-block:: bash

	pip3 install --user psutil xgboost cloudpickle

	To make TVM run faster in tuning, it is recommended to use cython as FFI of
	TVM. In the root directory of TVM, execute:

	.. code-block:: bash

	pip3 install --user cython
	sudo make cython3

	Now return to python code. Begin by importing the required packages.

	.. GENERATED FROM PYTHON SOURCE LINES 66-79

	.. code-block:: default



	import logging
	import sys

	import numpy as np
	import tvm
	from tvm import te
	import tvm.testing

	# the module is called `autotvm`
	from tvm import autotvm








	.. GENERATED FROM PYTHON SOURCE LINES 80-87

	Basic Matrix Multiplication with TE
	-----------------------------------
	Recall the basic implementation of matrix multiplication using TE. We write
	it down here with a few changes. We will wrap the multiplication in a python
	function definition. For simplicity, we will focus our attention on a split
	optimization, using a fixed value that defines the block size of the
	reordering.

	.. GENERATED FROM PYTHON SOURCE LINES 87-110

	.. code-block:: default



	def matmul_basic(N, L, M, dtype):

	A = te.placeholder((N, L), name="A", dtype=dtype)
	B = te.placeholder((L, M), name="B", dtype=dtype)

	k = te.reduce_axis((0, L), name="k")
	C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
	s = te.create_schedule(C.op)

	# schedule
	y, x = s[C].op.axis
	k = s[C].op.reduce_axis[0]

	yo, yi = s[C].split(y, 8)
	xo, xi = s[C].split(x, 8)

	s[C].reorder(yo, xo, k, yi, xi)

	return s, [A, B, C]









	.. GENERATED FROM PYTHON SOURCE LINES 111-124

	Matrix Multiplication with AutoTVM
	----------------------------------
	In the previous schedule code, we use a constant "8" as the tiling factor.
	However, it might not be the best one because the best tiling factor depends
	on real hardware environment and input shape.

	If you want the schedule code to be portable across a wider range of input
	shapes and target hardware, it is better to define a set of candidate values
	and pick the best one according to the measurement results on target
	hardware.

	In autotvm, we can define a tunable parameter, or a "knob" for such kind of
	value.

	.. GENERATED FROM PYTHON SOURCE LINES 126-130

	A Basic Matrix Multiplication Template
	--------------------------------------
	We begin with an example of how to create a tunable parameter set for the
	block size of the `split` scheduling operation.

	.. GENERATED FROM PYTHON SOURCE LINES 130-161

	.. code-block:: default


	# Matmul V1: List candidate values
	@autotvm.template("tutorial/matmul_v1") # 1. use a decorator
	def matmul_v1(N, L, M, dtype):
	A = te.placeholder((N, L), name="A", dtype=dtype)
	B = te.placeholder((L, M), name="B", dtype=dtype)

	k = te.reduce_axis((0, L), name="k")
	C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
	s = te.create_schedule(C.op)

	# schedule
	y, x = s[C].op.axis
	k = s[C].op.reduce_axis[0]

	# 2. get the config object
	cfg = autotvm.get_config()

	# 3. define search space
	cfg.define_knob("tile_y", [1, 2, 4, 8, 16])
	cfg.define_knob("tile_x", [1, 2, 4, 8, 16])

	# 4. schedule according to config
	yo, yi = s[C].split(y, cfg["tile_y"].val)
	xo, xi = s[C].split(x, cfg["tile_x"].val)

	s[C].reorder(yo, xo, k, yi, xi)

	return s, [A, B, C]









	.. GENERATED FROM PYTHON SOURCE LINES 162-197

	Here we make four modifications to the previous schedule code and get a
	tunable "template". We can explain the modifications one by one.

	1. Use a decorator to mark this function as a simple template.
	2. Get a config object: You can regard this :code:`cfg` as an argument of
	this function but we obtain it in a different way. With this argument, this
	function is no longer a deterministic schedule. Instead, we can pass
	different configurations to this function and get different schedules. A
	function that uses a configuration object like this is called a "template".

	To make the template function more compact, we can do two things to define
	the parameter search space within a single function.

	1. Define a search space across a set values. This is done by making
	:code:`cfg` a :any:`ConfigSpace` object. It will collect all of the
	tunable knobs in this function and build a search space from it.
	2. Schedule according to an entity in this space. This is done by making
	:code:`cfg` a :any:`ConfigEntity` object. When it is a
	:any:`ConfigEntity`, it will ignore all space definition API (namely,
	:code:`cfg.define_XXXXX(...)`). Instead, it will store deterministic
	values for all tunable knobs, and we schedule according to these values.

	During auto-tuning, we will first call this template with a
	:any:`ConfigSpace` object to build the search space. Then we call this
	template with different :any:`ConfigEntity` in the built space to get
	different schedules. Finally we will measure the code generated by
	different schedules and pick the best one.

	3. Define two tunable knobs. The first one is :code:`tile_y` with 5 possible
	values. The second one is :code:`tile_x` with a same list of possible values.
	These two knobs are independent, so they span a search space with size 25 =
	5x5.
	4. The configuration knobs are passed to the :code:`split` schedule
	operation, allowing us to schedule according to the 5x5 deterministic values
	we previously defined in :code:`cfg`.

	.. GENERATED FROM PYTHON SOURCE LINES 199-215

	A Matrix Multiplication Template with the Advanced Parameter API
	----------------------------------------------------------------
	In the previous template, we manually listed all of the possible values for a
	knob. This is the lowest level API to define the space, and gives an explicit
	enumeration of the parameter space to search. However, we also provide
	another set of APIs that can make the definition of the search space easier
	and smarter. Where possible, we recommend you use this higher-level API

	In the following example, we use :any:`ConfigSpace.define_split` to define a
	split knob. It will enumerate all the possible ways to split an axis and
	construct the space.

	We also have :any:`ConfigSpace.define_reorder` for reorder knob and
	:any:`ConfigSpace.define_annotate` for annotation like unroll, vectorization,
	thread binding. When the high level API cannot meet your requirements, you
	can always fall back to using the low level API.

	.. GENERATED FROM PYTHON SOURCE LINES 215-245

	.. code-block:: default



	@autotvm.template("tutorial/matmul")
	def matmul(N, L, M, dtype):
	A = te.placeholder((N, L), name="A", dtype=dtype)
	B = te.placeholder((L, M), name="B", dtype=dtype)

	k = te.reduce_axis((0, L), name="k")
	C = te.compute((N, M), lambda i, j: te.sum(A[i, k] * B[k, j], axis=k), name="C")
	s = te.create_schedule(C.op)

	# schedule
	y, x = s[C].op.axis
	k = s[C].op.reduce_axis[0]

	##### define space begin #####
	cfg = autotvm.get_config()
	cfg.define_split("tile_y", y, num_outputs=2)
	cfg.define_split("tile_x", x, num_outputs=2)
	##### define space end #####

	# schedule according to config
	yo, yi = cfg["tile_y"].apply(s, C, y)
	xo, xi = cfg["tile_x"].apply(s, C, x)

	s[C].reorder(yo, xo, k, yi, xi)

	return s, [A, B, C]









	.. GENERATED FROM PYTHON SOURCE LINES 246-266

	.. admonition:: More Explanation on :code:`cfg.define_split`

	In this template, :code:`cfg.define_split("tile_y", y, num_outputs=2)` will
	enumerate all possible combinations that can split axis y into two axes with
	factors of the length of y. For example, if the length of y is 32 and we
	want to split it into two axes using factors of 32, then there are 6
	possible values for (length of outer axis, length of inner axis) pair,
	namely (32, 1), (16, 2), (8, 4), (4, 8), (2, 16) or (1, 32). These are all 6
	possible values of `tile_y`.

	During scheduling, :code:`cfg["tile_y"]` is a :code:`SplitEntity` object.
	We stores the lengths of outer axes and inner axes in
	:code:`cfg['tile_y'].size` (a tuple with two elements). In this template,
	we apply it by using :code:`yo, yi = cfg['tile_y'].apply(s, C, y)`.
	Actually, this is equivalent to :code:`yo, yi = s[C].split(y,
	cfg["tile_y"].size[1])` or :code:`yo, yi = s[C].split(y,
	nparts=cfg['tile_y"].size[0])`

	The advantage of using cfg.apply API is that it makes multi-level splits
	(that is, when num_outputs >= 3) easier.

	.. GENERATED FROM PYTHON SOURCE LINES 268-302

	Step 2: Use AutoTVM to Optimize the Matrix Multiplication
	---------------------------------------------------------
	In Step 1, we wrote a matrix multiplication template that allowed us to
	parameterize the block size used in the `split` schedule. We can now conduct
	a search over this parameter space. The next step is to pick a tuner to guide
	the exploration of this space.

	Auto-tuners in TVM
	~~~~~~~~~~~~~~~~~~
	The job for a tuner can be described by following pseudo code

	.. code-block:: c

	ct = 0
	while ct < max_number_of_trials:
	propose a batch of configs
	measure this batch of configs on real hardware and get results
	ct += batch_size

	When proposing the next batch of configs, the tuner can take different
	strategies. Some of the tuner strategies provided by TVM include:

	* :any:`tvm.autotvm.tuner.RandomTuner`: Enumerate the space in a random order
	* :any:`tvm.autotvm.tuner.GridSearchTuner`: Enumerate the space in a grid search order
	* :any:`tvm.autotvm.tuner.GATuner`: Using genetic algorithm to search through the space
	* :any:`tvm.autotvm.tuner.XGBTuner`: Uses a model based method. Train a XGBoost model to
	predict the speed of lowered IR and pick the next batch according to the
	prediction.

	You can choose the tuner according to the size of your space, your time
	budget and other factors. For example, if your space is very small (less
	than 1000), a grid-search tuner or a random tuner is good enough. If your
	space is at the level of 10^9 (this is the space size of a conv2d operator on
	CUDA GPU), XGBoostTuner can explore more efficiently and find better configs.

	.. GENERATED FROM PYTHON SOURCE LINES 304-310

	Begin tuning
	~~~~~~~~~~~~
	Here we continue our matrix multiplication example. First we create a tuning
	task. We can also inspect the initialized search space. In this case, for a
	512x512 square matrix multiplication, the space size is 10x10=100 Note that
	the task and search space are independent of the tuner picked.

	.. GENERATED FROM PYTHON SOURCE LINES 310-315

	.. code-block:: default


	N, L, M = 512, 512, 512
	task = autotvm.task.create("tutorial/matmul", args=(N, L, M, "float32"), target="llvm")
	print(task.config_space)





	.. rst-class:: sphx-glr-script-out

	.. code-block:: none

	ConfigSpace (len=100, range_length=100, space_map=
	0 tile_y: Split(policy=factors, product=512, num_outputs=2) len=10
	1 tile_x: Split(policy=factors, product=512, num_outputs=2) len=10
	)




	.. GENERATED FROM PYTHON SOURCE LINES 316-323

	Then we need to define how to measure the generated code and pick a tuner.
	Since our space is small, a random tuner is just okay.

	We only make 10 trials in this tutorial for demonstration. In practice, you
	can do more trials according to your time budget. We will log the tuning
	results into a log file. This file can be used to choose the best
	configuration discovered by the tuner later.

	.. GENERATED FROM PYTHON SOURCE LINES 323-328

	.. code-block:: default


	# logging config (for printing tuning log to the screen)
	logging.getLogger("autotvm").setLevel(logging.DEBUG)
	logging.getLogger("autotvm").addHandler(logging.StreamHandler(sys.stdout))








	.. GENERATED FROM PYTHON SOURCE LINES 329-332

	There are two steps for measuring a config: build and run. By default, we use
	all CPU cores to compile program. We then measure them sequentially. To help
	reduce variance, we take 5 measurements and average them.

	.. GENERATED FROM PYTHON SOURCE LINES 332-343

	.. code-block:: default

	measure_option = autotvm.measure_option(builder="local", runner=autotvm.LocalRunner(number=5))

	# Begin tuning with RandomTuner, log records to file `matmul.log`
	# You can use alternatives like XGBTuner.
	tuner = autotvm.tuner.RandomTuner(task)
	tuner.tune(
	n_trial=10,
	measure_option=measure_option,
	callbacks=[autotvm.callback.log_to_file("matmul.log")],
	)





	.. rst-class:: sphx-glr-script-out

	.. code-block:: none

	waiting for device...
	device available
	Get devices for measurement successfully!
	No: 1 GFLOPS: 9.95/9.95 result: MeasureResult(costs=(0.026991350199999996,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7835841178894043, timestamp=1714772753.7806447) [('tile_y', [-1, 512]), ('tile_x', [-1, 128])],None,79
	No: 2 GFLOPS: 8.75/9.95 result: MeasureResult(costs=(0.0306641994,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7925927639007568, timestamp=1714772754.5419) [('tile_y', [-1, 128]), ('tile_x', [-1, 16])],None,47
	No: 3 GFLOPS: 10.66/10.66 result: MeasureResult(costs=(0.025178763799999997,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7073245048522949, timestamp=1714772755.2139173) [('tile_y', [-1, 256]), ('tile_x', [-1, 256])],None,88
	No: 4 GFLOPS: 11.90/11.90 result: MeasureResult(costs=(0.0225620518,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.6628890037536621, timestamp=1714772755.8438463) [('tile_y', [-1, 8]), ('tile_x', [-1, 256])],None,83
	No: 5 GFLOPS: 3.60/11.90 result: MeasureResult(costs=(0.0746499722,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.4612109661102295, timestamp=1714772757.4345112) [('tile_y', [-1, 2]), ('tile_x', [-1, 2])],None,11
	No: 6 GFLOPS: 3.62/11.90 result: MeasureResult(costs=(0.0740701102,), error_no=MeasureErrorNo.NO_ERROR, all_cost=1.471846342086792, timestamp=1714772758.8837745) [('tile_y', [-1, 128]), ('tile_x', [-1, 8])],None,37
	No: 7 GFLOPS: 10.27/11.90 result: MeasureResult(costs=(0.026150489800000003,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.7033135890960693, timestamp=1714772759.5768602) [('tile_y', [-1, 256]), ('tile_x', [-1, 512])],None,98
	No: 8 GFLOPS: 10.28/11.90 result: MeasureResult(costs=(0.026110290199999997,), error_no=MeasureErrorNo.NO_ERROR, all_cost=0.6939852237701416, timestamp=1714772760.2588608) [('tile_y', [-1, 2]), ('tile_x', [-1, 32])],None,51
	No: 9 GFLOPS: 0.46/11.90 result: MeasureResult(costs=(0.584163126,), error_no=MeasureErrorNo.NO_ERROR, all_cost=9.571329832077026, timestamp=1714772769.956309) [('tile_y', [-1, 512]), ('tile_x', [-1, 1])],None,9
	No: 10 GFLOPS: 0.51/11.90 result: MeasureResult(costs=(0.5271286306,), error_no=MeasureErrorNo.NO_ERROR, all_cost=8.672711849212646, timestamp=1714772778.652354) [('tile_y', [-1, 128]), ('tile_x', [-1, 1])],None,7




	.. GENERATED FROM PYTHON SOURCE LINES 344-351

	With tuning completed, we can choose the configuration from the log file that
	has the best measured performance and compile the schedule with the
	corresponding parameters. We also do a quick verification that the schedule is
	producing correct answers. We can call the function :code:`matmul` directly
	under the :any:`autotvm.apply_history_best` context. When we call this
	function, it will query the dispatch context with its argument and get the
	best config with the same argument.

	.. GENERATED FROM PYTHON SOURCE LINES 351-368

	.. code-block:: default


	# apply history best from log file
	with autotvm.apply_history_best("matmul.log"):
	with tvm.target.Target("llvm"):
	s, arg_bufs = matmul(N, L, M, "float32")
	func = tvm.build(s, arg_bufs)

	# check correctness
	a_np = np.random.uniform(size=(N, L)).astype(np.float32)
	b_np = np.random.uniform(size=(L, M)).astype(np.float32)
	c_np = a_np.dot(b_np)

	c_tvm = tvm.nd.empty(c_np.shape)
	func(tvm.nd.array(a_np), tvm.nd.array(b_np), c_tvm)

	tvm.testing.assert_allclose(c_np, c_tvm.numpy(), rtol=1e-4)





	.. rst-class:: sphx-glr-script-out

	.. code-block:: none

	Finish loading 10 records




	.. GENERATED FROM PYTHON SOURCE LINES 369-379

	Final Notes and Summary
	-----------------------
	In this tutorial, we have shown how to build operator templates that allow
	TVM to search a parameter space and choose optimized schedule configurations.
	To gain a deeper understanding of how this works, we recommend expanding on
	this example by adding new search parameters to the schedule based on
	schedule operations demonstrated in the :ref: `Getting Started With Tensor
	Expressions <tensor_expr_get_started>_` tutorial. In the upcoming sections, we
	will demonstrate the AutoScheduler, a method for TVM to optimize common
	operators without the need for the user to provide a user-defined template.


	.. _sphx_glr_download_tutorial_autotvm_matmul_x86.py:

	.. only:: html

	.. container:: sphx-glr-footer sphx-glr-footer-example


	.. container:: sphx-glr-download sphx-glr-download-python

	:download:`Download Python source code: autotvm_matmul_x86.py <autotvm_matmul_x86.py>`

	.. container:: sphx-glr-download sphx-glr-download-jupyter

	:download:`Download Jupyter notebook: autotvm_matmul_x86.ipynb <autotvm_matmul_x86.ipynb>`


	.. only:: html

	.. rst-class:: sphx-glr-signature

	`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_