docs/_sources/tutorial/auto_scheduler_matmul_x86.rst.txt - tvm-site - Git at Google

 .. note::
     :class: sphx-glr-download-link-note

     Click :ref:`here <sphx_glr_download_tutorial_auto_scheduler_matmul_x86.py>` to download the full example code
 .. rst-class:: sphx-glr-example-title

 .. _sphx_glr_tutorial_auto_scheduler_matmul_x86.py:


 Optimizing Operators with Auto-scheduling
 =========================================
 **Author**: `Lianmin Zheng <https://github.com/merrymercy>`_,             `Chengfan Jia <https://github.com/jcf94/>`_

 In this tutorial, we will show how TVM's Auto Scheduling feature can find
 optimal schedules without the need for writing a custom template.

 Different from the template-based :doc:`AutoTVM <autotvm_matmul_x86>` which relies on
 manual templates to define the search space, the auto-scheduler does not
 require any templates.  Users only need to write the computation declaration
 without any schedule commands or templates.  The auto-scheduler can
 automatically generate a large search space and find a good schedule in the
 space.

 We use matrix multiplication as an example in this tutorial.

 .. note::
   Note that this tutorial will not run on Windows or recent versions of macOS. To
   get it to run, you will need to wrap the body of this tutorial in a :code:`if
   __name__ == "__main__":` block.


 .. code-block:: default


     import os

     import numpy as np
     import tvm
     from tvm import te, auto_scheduler


 Defining the Matrix Multiplication
 ----------------------------------
 To start, we define a matrix multiplication with a bias addition.  Note that
 this uses standard operations available in TVMs Tensor Expression language.
 The major difference is the use of the :any:`register_workload` decorator at the top
 of the function definition.  The function should return a list of
 input/output tensors.  From these tensors, the auto-scheduler can get the
 whole computational graph.


 .. code-block:: default


     @auto_scheduler.register_workload  # Note the auto_scheduler decorator
     def matmul_add(N, L, M, dtype):
         A = te.placeholder((N, L), name="A", dtype=dtype)
         B = te.placeholder((L, M), name="B", dtype=dtype)
         C = te.placeholder((N, M), name="C", dtype=dtype)

         k = te.reduce_axis((0, L), name="k")
         matmul = te.compute(
             (N, M),
             lambda i, j: te.sum(A[i, k] * B[k, j], axis=k),
             name="matmul",
             attrs={"layout_free_placeholders": [B]},  # enable automatic layout transform for tensor B
         )
         out = te.compute((N, M), lambda i, j: matmul[i, j] + C[i, j], name="out")

         return [A, B, C, out]


 Create the search task
 ----------------------
 With the function defined, we can now create the task for the auto_scheduler
 to search against. We specify the particular parameters for this matrix
 multiplication, in this case a multiplication of to square matricies of size
 1024x1024. We then create a search task with N=L=M=1024 and dtype="float32"

 .. admonition:: Improve performance with custom targets

   In order for TVM to take full advantage of specific hardware platforms,
   you will want to manuall specify your CPU capabilities. For example:

     - replace ``llvm`` below with ``llvm -mcpu=core-avx2`` to enable AVX2
     - replace ``llvm`` below with ``llvm -mcpu=skylake-avx512`` to enable AVX-512


 .. code-block:: default


     target = tvm.target.Target("llvm")
     N = L = M = 1024
     task = tvm.auto_scheduler.SearchTask(func=matmul_add, args=(N, L, M, "float32"), target=target)

     # Inspect the computational graph
     print("Computational DAG:")
     print(task.compute_dag)


 .. rst-class:: sphx-glr-script-out

  Out:

  .. code-block:: none

     Computational DAG:
     A = PLACEHOLDER [1024, 1024]
     B = PLACEHOLDER [1024, 1024]
     matmul(i, j) += (A[i, k]*B[k, j])
     C = PLACEHOLDER [1024, 1024]
     out(i, j) = (matmul[i, j] + C[i, j])


 Set Parameters for Auto-Scheduler
 ---------------------------------
 Next, we set parameters for the auto-scheduler.

 * :code:`num_measure_trials` is the number of measurement trials we can use
   during the search.  We only make 10 trials in this tutorial for a fast
   demonstration. In practice, 1000 is a good value for the search to converge.
   You can do more trials according to your time budget.
 * In addition, we use :any:`RecordToFile <auto_scheduler.RecordToFile>` to log measurement records into a
   file ``matmul.json``.  The measurement records can be used to query the history
   best, resume the search, and do more analyses later.
 * see :any:`TuningOptions <auto_scheduler.TuningOptions>` for more parameters


 .. code-block:: default


     log_file = "matmul.json"
     tune_option = auto_scheduler.TuningOptions(
         num_measure_trials=10,
         measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
         verbose=2,
     )


 Run the search
 --------------
 Now we get all inputs ready. Pretty simple, isn't it?  We can kick off the
 search and let the auto-scheduler do its magic.  After some measurement
 trials, we can load the best schedule from the log file and apply it.


 .. code-block:: default


     # Run auto-tuning (search)
     task.tune(tune_option)
     # Apply the best schedule
     sch, args = task.apply_best(log_file)


 .. rst-class:: sphx-glr-script-out

  Out:

  .. code-block:: none


 Inspecting the Optimized Schedule
 ---------------------------------
 We can lower the schedule to see the IR after auto-scheduling.  The
 auto-scheduler correctly performs optimizations including multi-level tiling,
 layout transformation, parallelization, vectorization, unrolling, and
 operator fusion.


 .. code-block:: default


     print("Lowered TIR:")
     print(tvm.lower(sch, args, simple_mode=True))


 .. rst-class:: sphx-glr-script-out

  Out:

  .. code-block:: none

     Lowered TIR:
     @main = primfn(A_1: handle, B_1: handle, C_1: handle, out_1: handle) -> ()
       attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
       buffers = {C: Buffer(C_2: Pointer(float32), float32, [1048576], []),
                  A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
                  out: Buffer(out_2: Pointer(float32), float32, [1048576], []),
                  B: Buffer(B_2: Pointer(float32), float32, [1048576], [])}
       buffer_map = {A_1: A, B_1: B, C_1: C, out_1: out} {
       allocate(auto_scheduler_layout_transform: Pointer(global float32), float32, [1048576]), storage_scope = global {
         for (ax0.ax1.fused.ax2.fused: int32, 0, 128) "parallel" {
           for (ax4: int32, 0, 256) {
             for (ax6: int32, 0, 4) {
               for (ax7: int32, 0, 8) {
                 auto_scheduler_layout_transform_1: Buffer(auto_scheduler_layout_transform, float32, [1048576], [])[((((ax0.ax1.fused.ax2.fused*8192) + (ax4*32)) + (ax6*8)) + ax7)] = B[((((ax4*4096) + (ax6*1024)) + (ax0.ax1.fused.ax2.fused*8)) + ax7)]
               }
             }
           }
         }
         for (i.outer.outer.j.outer.outer.fused: int32, 0, 16384) "parallel" {
           allocate(matmul: Pointer(global float32x8), float32x8, [4]), storage_scope = global;
           for (i.outer.inner: int32, 0, 2) {
             matmul_1: Buffer(matmul, float32x8, [4], [])[0] = broadcast(0f32, 8)
             matmul_1[1] = broadcast(0f32, 8)
             matmul_1[2] = broadcast(0f32, 8)
             matmul_1[3] = broadcast(0f32, 8)
             for (k.outer: int32, 0, 256) {
               for (k.inner: int32, 0, 4) {
                 let cse_var_2: int32 = (((floormod(i.outer.outer.j.outer.outer.fused, 128)*8192) + (k.outer*32)) + (k.inner*8))
                 let cse_var_1: int32 = ((((floordiv(i.outer.outer.j.outer.outer.fused, 128)*8192) + (i.outer.inner*4096)) + (k.outer*4)) + k.inner)
                  {
                   matmul_1[0] = (matmul_1[0] + (broadcast(A[cse_var_1], 8)*auto_scheduler_layout_transform_1[ramp(cse_var_2, 1, 8)]))
                   matmul_1[1] = (matmul_1[1] + (broadcast(A[(cse_var_1 + 1024)], 8)*auto_scheduler_layout_transform_1[ramp(cse_var_2, 1, 8)]))
                   matmul_1[2] = (matmul_1[2] + (broadcast(A[(cse_var_1 + 2048)], 8)*auto_scheduler_layout_transform_1[ramp(cse_var_2, 1, 8)]))
                   matmul_1[3] = (matmul_1[3] + (broadcast(A[(cse_var_1 + 3072)], 8)*auto_scheduler_layout_transform_1[ramp(cse_var_2, 1, 8)]))
                 }
               }
             }
             for (i.inner: int32, 0, 4) {
               let cse_var_3: int32 = ((((floordiv(i.outer.outer.j.outer.outer.fused, 128)*8192) + (i.outer.inner*4096)) + (i.inner*1024)) + (floormod(i.outer.outer.j.outer.outer.fused, 128)*8))
               out[ramp(cse_var_3, 1, 8)] = (matmul_1[i.inner] + C[ramp(cse_var_3, 1, 8)])
             }
           }
         }
       }
     }


 Check correctness and evaluate performance
 ------------------------------------------
 We build the binary and check its correctness and performance.


 .. code-block:: default


     func = tvm.build(sch, args, target)
     a_np = np.random.uniform(size=(N, L)).astype(np.float32)
     b_np = np.random.uniform(size=(L, M)).astype(np.float32)
     c_np = np.random.uniform(size=(N, M)).astype(np.float32)
     out_np = a_np.dot(b_np) + c_np

     dev = tvm.cpu()
     a_tvm = tvm.nd.array(a_np, device=dev)
     b_tvm = tvm.nd.array(b_np, device=dev)
     c_tvm = tvm.nd.array(c_np, device=dev)
     out_tvm = tvm.nd.empty(out_np.shape, device=dev)
     func(a_tvm, b_tvm, c_tvm, out_tvm)

     # Check results
     np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3)

     # Evaluate execution time.
     evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)
     print(
         "Execution time of this operator: %.3f ms"
         % (np.median(evaluator(a_tvm, b_tvm, c_tvm, out_tvm).results) * 1000)
     )


 .. rst-class:: sphx-glr-script-out

  Out:

  .. code-block:: none

     Execution time of this operator: 94.252 ms


 Using the record file
 ---------------------
 During the search, all measurement records are logged into the record file
 ``matmul.json```. The measurement records can be used to re-apply search
 results, resume the search, and perform other analyses.

 Here is an example where we load the best schedule from a file, and print the
 equivalent python schedule API. This can be used for debugging and learning
 the behavior of the auto-scheduler.


 .. code-block:: default


     print("Equivalent python schedule:")
     print(task.print_best(log_file))


 .. rst-class:: sphx-glr-script-out

  Out:

  .. code-block:: none

     Equivalent python schedule:
     matmul_i, matmul_j, matmul_k = tuple(matmul.op.axis) + tuple(matmul.op.reduce_axis)
     out_i, out_j = tuple(out.op.axis) + tuple(out.op.reduce_axis)
     matmul_i_o_i, matmul_i_i = s[matmul].split(matmul_i, factor=4)
     matmul_i_o_o_i, matmul_i_o_i = s[matmul].split(matmul_i_o_i, factor=1)
     matmul_i_o_o_o, matmul_i_o_o_i = s[matmul].split(matmul_i_o_o_i, factor=2)
     matmul_j_o_i, matmul_j_i = s[matmul].split(matmul_j, factor=8)
     matmul_j_o_o_i, matmul_j_o_i = s[matmul].split(matmul_j_o_i, factor=1)
     matmul_j_o_o_o, matmul_j_o_o_i = s[matmul].split(matmul_j_o_o_i, factor=1)
     matmul_k_o, matmul_k_i = s[matmul].split(matmul_k, factor=4)
     s[matmul].reorder(matmul_i_o_o_o, matmul_j_o_o_o, matmul_i_o_o_i, matmul_j_o_o_i, matmul_k_o, matmul_i_o_i, matmul_j_o_i, matmul_k_i, matmul_i_i, matmul_j_i)
     out_i_o_i, out_i_i = s[out].split(out_i, factor=4)
     out_i_o_o, out_i_o_i = s[out].split(out_i_o_i, factor=2)
     out_j_o_i, out_j_i = s[out].split(out_j, factor=8)
     out_j_o_o, out_j_o_i = s[out].split(out_j_o_i, factor=1)
     s[out].reorder(out_i_o_o, out_j_o_o, out_i_o_i, out_j_o_i, out_i_i, out_j_i)
     s[matmul].compute_at(s[out], out_j_o_i)
     out_i_o_o_j_o_o_fused = s[out].fuse(out_i_o_o, out_j_o_o)
     s[out].parallel(out_i_o_o_j_o_o_fused)
     s[matmul].pragma(matmul_i_o_o_o, "auto_unroll_max_step", 8)
     s[matmul].pragma(matmul_i_o_o_o, "unroll_explicit", True)
     s[matmul].vectorize(matmul_j_i)
     s[out].vectorize(out_j_i)


 A more complicated example is to resume the search.  In this case, we need to
 create the search policy and cost model by ourselves and resume the status of
 search policy and cost model with the log file.  In the example below we
 resume the status and do more 5 trials.


 .. code-block:: default


     def resume_search(task, log_file):
         print("Resume search:")
         cost_model = auto_scheduler.XGBModel()
         cost_model.update_from_file(log_file)
         search_policy = auto_scheduler.SketchPolicy(
             task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)]
         )
         tune_option = auto_scheduler.TuningOptions(
             num_measure_trials=5, measure_callbacks=[auto_scheduler.RecordToFile(log_file)]
         )
         task.tune(tune_option, search_policy=search_policy)


     resume_search(task, log_file)


 .. rst-class:: sphx-glr-script-out

  Out:

  .. code-block:: none

     Resume search:
     /usr/local/lib/python3.6/dist-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated.  See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
       warnings.warn(f'Old style callback is deprecated.  See: {link}', UserWarning)


 Final Notes and Summary
 -----------------------
 In this tutorial, we have shown how to use the TVM Auto-Scheduler to
 automatically optimize a matrix multiplication, without the need to specify a
 search template.  It ends a series of examples that starts from the Tensor
 Expression (TE) language that demonstrates how TVM can optimize computational
 operations.


 .. _sphx_glr_download_tutorial_auto_scheduler_matmul_x86.py:


 .. only :: html

  .. container:: sphx-glr-footer
     :class: sphx-glr-footer-example


   .. container:: sphx-glr-download

      :download:`Download Python source code: auto_scheduler_matmul_x86.py <auto_scheduler_matmul_x86.py>`


   .. container:: sphx-glr-download

      :download:`Download Jupyter notebook: auto_scheduler_matmul_x86.ipynb <auto_scheduler_matmul_x86.ipynb>`


 .. only:: html

  .. rst-class:: sphx-glr-signature

     `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_
	.. note::
	:class: sphx-glr-download-link-note

	Click :ref:`here <sphx_glr_download_tutorial_auto_scheduler_matmul_x86.py>` to download the full example code
	.. rst-class:: sphx-glr-example-title

	.. _sphx_glr_tutorial_auto_scheduler_matmul_x86.py:


	Optimizing Operators with Auto-scheduling
	=========================================
	Author: `Lianmin Zheng <https://github.com/merrymercy>`_, `Chengfan Jia <https://github.com/jcf94/>`_

	In this tutorial, we will show how TVM's Auto Scheduling feature can find
	optimal schedules without the need for writing a custom template.

	Different from the template-based :doc:`AutoTVM <autotvm_matmul_x86>` which relies on
	manual templates to define the search space, the auto-scheduler does not
	require any templates. Users only need to write the computation declaration
	without any schedule commands or templates. The auto-scheduler can
	automatically generate a large search space and find a good schedule in the
	space.

	We use matrix multiplication as an example in this tutorial.

	.. note::
	Note that this tutorial will not run on Windows or recent versions of macOS. To
	get it to run, you will need to wrap the body of this tutorial in a :code:`if
	__name__ == "__main__":` block.


	.. code-block:: default


	import os

	import numpy as np
	import tvm
	from tvm import te, auto_scheduler







	Defining the Matrix Multiplication
	----------------------------------
	To start, we define a matrix multiplication with a bias addition. Note that
	this uses standard operations available in TVMs Tensor Expression language.
	The major difference is the use of the :any:`register_workload` decorator at the top
	of the function definition. The function should return a list of
	input/output tensors. From these tensors, the auto-scheduler can get the
	whole computational graph.


	.. code-block:: default



	@auto_scheduler.register_workload # Note the auto_scheduler decorator
	def matmul_add(N, L, M, dtype):
	A = te.placeholder((N, L), name="A", dtype=dtype)
	B = te.placeholder((L, M), name="B", dtype=dtype)
	C = te.placeholder((N, M), name="C", dtype=dtype)

	k = te.reduce_axis((0, L), name="k")
	matmul = te.compute(
	(N, M),
	lambda i, j: te.sum(A[i, k] * B[k, j], axis=k),
	name="matmul",
	attrs={"layout_free_placeholders": [B]}, # enable automatic layout transform for tensor B
	)
	out = te.compute((N, M), lambda i, j: matmul[i, j] + C[i, j], name="out")

	return [A, B, C, out]








	Create the search task
	----------------------
	With the function defined, we can now create the task for the auto_scheduler
	to search against. We specify the particular parameters for this matrix
	multiplication, in this case a multiplication of to square matricies of size
	1024x1024. We then create a search task with N=L=M=1024 and dtype="float32"

	.. admonition:: Improve performance with custom targets

	In order for TVM to take full advantage of specific hardware platforms,
	you will want to manuall specify your CPU capabilities. For example:

	- replace ``llvm`` below with ``llvm -mcpu=core-avx2`` to enable AVX2
	- replace ``llvm`` below with ``llvm -mcpu=skylake-avx512`` to enable AVX-512


	.. code-block:: default


	target = tvm.target.Target("llvm")
	N = L = M = 1024
	task = tvm.auto_scheduler.SearchTask(func=matmul_add, args=(N, L, M, "float32"), target=target)

	# Inspect the computational graph
	print("Computational DAG:")
	print(task.compute_dag)





	.. rst-class:: sphx-glr-script-out

	Out:

	.. code-block:: none

	Computational DAG:
	A = PLACEHOLDER [1024, 1024]
	B = PLACEHOLDER [1024, 1024]
	matmul(i, j) += (A[i, k]*B[k, j])
	C = PLACEHOLDER [1024, 1024]
	out(i, j) = (matmul[i, j] + C[i, j])




	Set Parameters for Auto-Scheduler
	---------------------------------
	Next, we set parameters for the auto-scheduler.

	* :code:`num_measure_trials` is the number of measurement trials we can use
	during the search. We only make 10 trials in this tutorial for a fast
	demonstration. In practice, 1000 is a good value for the search to converge.
	You can do more trials according to your time budget.
	* In addition, we use :any:`RecordToFile <auto_scheduler.RecordToFile>` to log measurement records into a
	file ``matmul.json``. The measurement records can be used to query the history
	best, resume the search, and do more analyses later.
	* see :any:`TuningOptions <auto_scheduler.TuningOptions>` for more parameters


	.. code-block:: default


	log_file = "matmul.json"
	tune_option = auto_scheduler.TuningOptions(
	num_measure_trials=10,
	measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
	verbose=2,
	)







	Run the search
	--------------
	Now we get all inputs ready. Pretty simple, isn't it? We can kick off the
	search and let the auto-scheduler do its magic. After some measurement
	trials, we can load the best schedule from the log file and apply it.


	.. code-block:: default


	# Run auto-tuning (search)
	task.tune(tune_option)
	# Apply the best schedule
	sch, args = task.apply_best(log_file)





	.. rst-class:: sphx-glr-script-out

	Out:

	.. code-block:: none






	Inspecting the Optimized Schedule
	---------------------------------
	We can lower the schedule to see the IR after auto-scheduling. The
	auto-scheduler correctly performs optimizations including multi-level tiling,
	layout transformation, parallelization, vectorization, unrolling, and
	operator fusion.


	.. code-block:: default


	print("Lowered TIR:")
	print(tvm.lower(sch, args, simple_mode=True))





	.. rst-class:: sphx-glr-script-out

	Out:

	.. code-block:: none

	Lowered TIR:
	@main = primfn(A_1: handle, B_1: handle, C_1: handle, out_1: handle) -> ()
	attr = {"from_legacy_te_schedule": True, "global_symbol": "main", "tir.noalias": True}
	buffers = {C: Buffer(C_2: Pointer(float32), float32, [1048576], []),
	A: Buffer(A_2: Pointer(float32), float32, [1048576], []),
	out: Buffer(out_2: Pointer(float32), float32, [1048576], []),
	B: Buffer(B_2: Pointer(float32), float32, [1048576], [])}
	buffer_map = {A_1: A, B_1: B, C_1: C, out_1: out} {
	allocate(auto_scheduler_layout_transform: Pointer(global float32), float32, [1048576]), storage_scope = global {
	for (ax0.ax1.fused.ax2.fused: int32, 0, 128) "parallel" {
	for (ax4: int32, 0, 256) {
	for (ax6: int32, 0, 4) {
	for (ax7: int32, 0, 8) {
	auto_scheduler_layout_transform_1: Buffer(auto_scheduler_layout_transform, float32, [1048576], [])[((((ax0.ax1.fused.ax2.fused8192) + (ax432)) + (ax68)) + ax7)] = B[((((ax44096) + (ax61024)) + (ax0.ax1.fused.ax2.fused8)) + ax7)]
	}
	}
	}
	}
	for (i.outer.outer.j.outer.outer.fused: int32, 0, 16384) "parallel" {
	allocate(matmul: Pointer(global float32x8), float32x8, [4]), storage_scope = global;
	for (i.outer.inner: int32, 0, 2) {
	matmul_1: Buffer(matmul, float32x8, [4], [])[0] = broadcast(0f32, 8)
	matmul_1[1] = broadcast(0f32, 8)
	matmul_1[2] = broadcast(0f32, 8)
	matmul_1[3] = broadcast(0f32, 8)
	for (k.outer: int32, 0, 256) {
	for (k.inner: int32, 0, 4) {
	let cse_var_2: int32 = (((floormod(i.outer.outer.j.outer.outer.fused, 128)8192) + (k.outer32)) + (k.inner*8))
	let cse_var_1: int32 = ((((floordiv(i.outer.outer.j.outer.outer.fused, 128)8192) + (i.outer.inner4096)) + (k.outer*4)) + k.inner)
	{
	matmul_1[0] = (matmul_1[0] + (broadcast(A[cse_var_1], 8)*auto_scheduler_layout_transform_1[ramp(cse_var_2, 1, 8)]))
	matmul_1[1] = (matmul_1[1] + (broadcast(A[(cse_var_1 + 1024)], 8)*auto_scheduler_layout_transform_1[ramp(cse_var_2, 1, 8)]))
	matmul_1[2] = (matmul_1[2] + (broadcast(A[(cse_var_1 + 2048)], 8)*auto_scheduler_layout_transform_1[ramp(cse_var_2, 1, 8)]))
	matmul_1[3] = (matmul_1[3] + (broadcast(A[(cse_var_1 + 3072)], 8)*auto_scheduler_layout_transform_1[ramp(cse_var_2, 1, 8)]))
	}
	}
	}
	for (i.inner: int32, 0, 4) {
	let cse_var_3: int32 = ((((floordiv(i.outer.outer.j.outer.outer.fused, 128)8192) + (i.outer.inner4096)) + (i.inner1024)) + (floormod(i.outer.outer.j.outer.outer.fused, 128)8))
	out[ramp(cse_var_3, 1, 8)] = (matmul_1[i.inner] + C[ramp(cse_var_3, 1, 8)])
	}
	}
	}
	}
	}





	Check correctness and evaluate performance
	------------------------------------------
	We build the binary and check its correctness and performance.


	.. code-block:: default


	func = tvm.build(sch, args, target)
	a_np = np.random.uniform(size=(N, L)).astype(np.float32)
	b_np = np.random.uniform(size=(L, M)).astype(np.float32)
	c_np = np.random.uniform(size=(N, M)).astype(np.float32)
	out_np = a_np.dot(b_np) + c_np

	dev = tvm.cpu()
	a_tvm = tvm.nd.array(a_np, device=dev)
	b_tvm = tvm.nd.array(b_np, device=dev)
	c_tvm = tvm.nd.array(c_np, device=dev)
	out_tvm = tvm.nd.empty(out_np.shape, device=dev)
	func(a_tvm, b_tvm, c_tvm, out_tvm)

	# Check results
	np.testing.assert_allclose(out_np, out_tvm.numpy(), rtol=1e-3)

	# Evaluate execution time.
	evaluator = func.time_evaluator(func.entry_name, dev, min_repeat_ms=500)
	print(
	"Execution time of this operator: %.3f ms"
	% (np.median(evaluator(a_tvm, b_tvm, c_tvm, out_tvm).results) * 1000)
	)






	.. rst-class:: sphx-glr-script-out

	Out:

	.. code-block:: none

	Execution time of this operator: 94.252 ms



	Using the record file
	---------------------
	During the search, all measurement records are logged into the record file
	``matmul.json```. The measurement records can be used to re-apply search
	results, resume the search, and perform other analyses.

	Here is an example where we load the best schedule from a file, and print the
	equivalent python schedule API. This can be used for debugging and learning
	the behavior of the auto-scheduler.


	.. code-block:: default


	print("Equivalent python schedule:")
	print(task.print_best(log_file))





	.. rst-class:: sphx-glr-script-out

	Out:

	.. code-block:: none

	Equivalent python schedule:
	matmul_i, matmul_j, matmul_k = tuple(matmul.op.axis) + tuple(matmul.op.reduce_axis)
	out_i, out_j = tuple(out.op.axis) + tuple(out.op.reduce_axis)
	matmul_i_o_i, matmul_i_i = s[matmul].split(matmul_i, factor=4)
	matmul_i_o_o_i, matmul_i_o_i = s[matmul].split(matmul_i_o_i, factor=1)
	matmul_i_o_o_o, matmul_i_o_o_i = s[matmul].split(matmul_i_o_o_i, factor=2)
	matmul_j_o_i, matmul_j_i = s[matmul].split(matmul_j, factor=8)
	matmul_j_o_o_i, matmul_j_o_i = s[matmul].split(matmul_j_o_i, factor=1)
	matmul_j_o_o_o, matmul_j_o_o_i = s[matmul].split(matmul_j_o_o_i, factor=1)
	matmul_k_o, matmul_k_i = s[matmul].split(matmul_k, factor=4)
	s[matmul].reorder(matmul_i_o_o_o, matmul_j_o_o_o, matmul_i_o_o_i, matmul_j_o_o_i, matmul_k_o, matmul_i_o_i, matmul_j_o_i, matmul_k_i, matmul_i_i, matmul_j_i)
	out_i_o_i, out_i_i = s[out].split(out_i, factor=4)
	out_i_o_o, out_i_o_i = s[out].split(out_i_o_i, factor=2)
	out_j_o_i, out_j_i = s[out].split(out_j, factor=8)
	out_j_o_o, out_j_o_i = s[out].split(out_j_o_i, factor=1)
	s[out].reorder(out_i_o_o, out_j_o_o, out_i_o_i, out_j_o_i, out_i_i, out_j_i)
	s[matmul].compute_at(s[out], out_j_o_i)
	out_i_o_o_j_o_o_fused = s[out].fuse(out_i_o_o, out_j_o_o)
	s[out].parallel(out_i_o_o_j_o_o_fused)
	s[matmul].pragma(matmul_i_o_o_o, "auto_unroll_max_step", 8)
	s[matmul].pragma(matmul_i_o_o_o, "unroll_explicit", True)
	s[matmul].vectorize(matmul_j_i)
	s[out].vectorize(out_j_i)




	A more complicated example is to resume the search. In this case, we need to
	create the search policy and cost model by ourselves and resume the status of
	search policy and cost model with the log file. In the example below we
	resume the status and do more 5 trials.


	.. code-block:: default



	def resume_search(task, log_file):
	print("Resume search:")
	cost_model = auto_scheduler.XGBModel()
	cost_model.update_from_file(log_file)
	search_policy = auto_scheduler.SketchPolicy(
	task, cost_model, init_search_callbacks=[auto_scheduler.PreloadMeasuredStates(log_file)]
	)
	tune_option = auto_scheduler.TuningOptions(
	num_measure_trials=5, measure_callbacks=[auto_scheduler.RecordToFile(log_file)]
	)
	task.tune(tune_option, search_policy=search_policy)


	resume_search(task, log_file)





	.. rst-class:: sphx-glr-script-out

	Out:

	.. code-block:: none

	Resume search:
	/usr/local/lib/python3.6/dist-packages/xgboost/training.py:17: UserWarning: Old style callback is deprecated. See: https://xgboost.readthedocs.io/en/latest/python/callbacks.html
	warnings.warn(f'Old style callback is deprecated. See: {link}', UserWarning)




	Final Notes and Summary
	-----------------------
	In this tutorial, we have shown how to use the TVM Auto-Scheduler to
	automatically optimize a matrix multiplication, without the need to specify a
	search template. It ends a series of examples that starts from the Tensor
	Expression (TE) language that demonstrates how TVM can optimize computational
	operations.


	.. _sphx_glr_download_tutorial_auto_scheduler_matmul_x86.py:


	.. only :: html

	.. container:: sphx-glr-footer
	:class: sphx-glr-footer-example



	.. container:: sphx-glr-download

	:download:`Download Python source code: auto_scheduler_matmul_x86.py <auto_scheduler_matmul_x86.py>`



	.. container:: sphx-glr-download

	:download:`Download Jupyter notebook: auto_scheduler_matmul_x86.ipynb <auto_scheduler_matmul_x86.ipynb>`


	.. only:: html

	.. rst-class:: sphx-glr-signature

	`Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_