| |
| .. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY |
| .. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE |
| .. CHANGES, EDIT THE SOURCE PYTHON FILE: |
| .. "how_to/optimize_operators/opt_gemm.py" |
| |
| .. only:: html |
| |
| .. note:: |
| :class: sphx-glr-download-link-note |
| |
| This tutorial can be used interactively with Google Colab! You can also click |
| :ref:`here <sphx_glr_download_how_to_optimize_operators_opt_gemm.py>` to run the Jupyter notebook locally. |
| |
| .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg |
| :align: center |
| :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/0f8d36b3ffd04a5a08089dc671eb788e/opt_gemm.ipynb |
| :width: 300px |
| |
| .. rst-class:: sphx-glr-example-title |
| |
| .. _sphx_glr_how_to_optimize_operators_opt_gemm.py: |
| |
| |
| .. _opt-gemm: |
| |
| How to optimize GEMM on CPU |
| =========================== |
| **Author**: `Jian Weng <https://github.com/were>`_, `Ruofei Yu <https://github.com/yuruofeifei>`_ |
| |
| (TL;DR) TVM provides abstract interfaces which allows users to depict an algorithm and the |
| algorithm's implementing organization (the so-called schedule) separately. Typically, writing |
| algorithm in high-performance schedule breaks the algorithm's readability and modularity. Also, |
| trying various seemingly promising schedules is time-consuming. With the help of TVM, we can |
| try these schedules efficiently to enhance the performance. |
| |
| In this tutorial, we will demonstrate how to use TVM to optimize square matrix multiplication |
| and achieve 200 times faster than baseline by simply adding 18 extra lines of code. |
| |
| There are two important optimizations on intense computation applications executed on CPU: |
| 1. Increase the cache hit rate of memory access. Both complex numerical computation and hot-spot |
| memory access can be accelerated from high cache hit rate. This requires us to transform the |
| origin memory access pattern to the pattern fits the cache policy. |
| 2. SIMD (Single instruction multi-data), or we call it vector processing unit. Every time, a |
| small batch of data, rather than a single grid, will be processed. This requires us to |
| transform the data access pattern in the loop body in uniform pattern so that the LLVM |
| backend can lower it to SIMD. |
| |
| Actually, all the methodologies used in this tutorial is a subset of tricks mentioned in this |
| `repo <https://github.com/flame/how-to-optimize-gemm>`_. Some of them have been applied by TVM |
| abstraction automatically, but some of them cannot be simply applied due to TVM constraints. |
| |
| All the experiment results mentioned below, are executed on 2015's 15' MacBook equipped with |
| Intel i7-4770HQ CPU. The cache line size should be 64 bytes for all the x86 CPUs. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 53-58 |
| |
| Preparation and Baseline |
| ------------------------ |
| In this tutorial, we will demo how to use TVM to optimize matrix multiplication. |
| Before actually demonstrating, we first define these variables. |
| Then we write a baseline implementation, the simplest way to write a matrix multiplication in TVM. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 58-119 |
| |
| .. code-block:: default |
| |
| |
| import tvm |
| import tvm.testing |
| from tvm import te |
| import numpy |
| import timeit |
| |
| # The size of the matrix |
| # (M, K) x (K, N) |
| # You are free to try out different shapes, sometimes TVM optimization outperforms numpy with MKL. |
| M = 1024 |
| K = 1024 |
| N = 1024 |
| |
| # The default tensor type in tvm |
| dtype = "float32" |
| |
| # using Intel AVX2(Advanced Vector Extensions) ISA for SIMD |
| # To get the best performance, please change the following line |
| # to llvm -mcpu=core-avx2, or specific type of CPU you use |
| target = "llvm" |
| dev = tvm.device(target, 0) |
| |
| # Random generated tensor for testing |
| a = tvm.nd.array(numpy.random.rand(M, K).astype(dtype), dev) |
| b = tvm.nd.array(numpy.random.rand(K, N).astype(dtype), dev) |
| |
| np_repeat = 100 |
| np_runing_time = timeit.timeit( |
| setup="import numpy\n" |
| "M = " + str(M) + "\n" |
| "K = " + str(K) + "\n" |
| "N = " + str(N) + "\n" |
| 'dtype = "float32"\n' |
| "a = numpy.random.rand(M, K).astype(dtype)\n" |
| "b = numpy.random.rand(K, N).astype(dtype)\n", |
| stmt="answer = numpy.dot(a, b)", |
| number=np_repeat, |
| ) |
| print("Numpy running time: %f" % (np_runing_time / np_repeat)) |
| |
| answer = numpy.dot(a.numpy(), b.numpy()) |
| |
| # Algorithm |
| k = te.reduce_axis((0, K), "k") |
| A = te.placeholder((M, K), name="A") |
| B = te.placeholder((K, N), name="B") |
| C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name="C") |
| |
| # Default schedule |
| s = te.create_schedule(C.op) |
| func = tvm.build(s, [A, B, C], target=target, name="mmult") |
| assert func |
| |
| c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev) |
| func(a, b, c) |
| tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5) |
| |
| evaluator = func.time_evaluator(func.entry_name, dev, number=1) |
| print("Baseline: %f" % evaluator(a, b, c).mean) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Numpy running time: 0.019075 |
| Baseline: 3.333430 |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 120-122 |
| |
| In TVM, we can always inspect lower level IR to debug or optimize our schedule. |
| Here is the generated IR using our baseline schedule. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 122-125 |
| |
| .. code-block:: default |
| |
| |
| print(tvm.lower(s, [A, B, C], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| for m, n in T.grid(1024, 1024): |
| C_1 = T.Buffer((1048576,), data=C.data) |
| C_1[m * 1024 + n] = T.float32(0) |
| for k in range(1024): |
| cse_var_2: T.int32 = m * 1024 |
| cse_var_1: T.int32 = cse_var_2 + n |
| A_1 = T.Buffer((1048576,), data=A.data) |
| B_1 = T.Buffer((1048576,), data=B.data) |
| C_1[cse_var_1] = C_1[cse_var_1] + A_1[cse_var_2 + k] * B_1[k * 1024 + n] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 126-132 |
| |
| Blocking |
| -------- |
| A important trick to enhance the cache hit rate is blocking --- data chunk will be computed |
| block by block. The memory access inside the block is a small neighbourhood which is with high |
| memory locality. In this tutorial, I picked up 32 as the blocking factor. So the block will |
| fill 32 * 32 * sizeof(float) which is 4KB in the cache whose total size is 32KB (L1 data cache) |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 132-157 |
| |
| .. code-block:: default |
| |
| |
| bn = 32 |
| kfactor = 4 |
| s = te.create_schedule(C.op) |
| |
| # Blocking by loop tiling |
| mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn) |
| (kaxis,) = s[C].op.reduce_axis |
| ko, ki = s[C].split(kaxis, factor=kfactor) |
| |
| # Hoist reduction domain outside the blocking loop |
| s[C].reorder(mo, no, ko, ki, mi, ni) |
| |
| func = tvm.build(s, [A, B, C], target=target, name="mmult") |
| assert func |
| |
| c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev) |
| func(a, b, c) |
| tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5) |
| |
| # By simply tiling the loop 32x32, and hoisting ko, ki outside the blocking loops, |
| # we can see big speedup compared with the baseline. |
| evaluator = func.time_evaluator(func.entry_name, dev, number=10) |
| print("Opt1: %f" % evaluator(a, b, c).mean) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Opt1: 0.325355 |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 158-159 |
| |
| Here is the generated IR after blocking. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 159-162 |
| |
| .. code-block:: default |
| |
| |
| print(tvm.lower(s, [A, B, C], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| for m_outer, n_outer in T.grid(32, 32): |
| C_1 = T.Buffer((1048576,), data=C.data) |
| for m_inner_init, n_inner_init in T.grid(32, 32): |
| C_1[m_outer * 32768 + m_inner_init * 1024 + n_outer * 32 + n_inner_init] = T.float32(0) |
| for k_outer, k_inner, m_inner, n_inner in T.grid(256, 4, 32, 32): |
| cse_var_3: T.int32 = n_outer * 32 |
| cse_var_2: T.int32 = m_outer * 32768 + m_inner * 1024 |
| cse_var_1: T.int32 = cse_var_2 + cse_var_3 + n_inner |
| A_1 = T.Buffer((1048576,), data=A.data) |
| B_1 = T.Buffer((1048576,), data=B.data) |
| C_1[cse_var_1] = C_1[cse_var_1] + A_1[cse_var_2 + k_outer * 4 + k_inner] * B_1[k_outer * 4096 + k_inner * 1024 + cse_var_3 + n_inner] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 163-171 |
| |
| Vectorization |
| ------------- |
| Another important trick is vectorization. When the memory access pattern is uniform, |
| the compiler can detect this pattern and pass the continuous memory to vector processor. In TVM, |
| we can use `vectorize` interface to hint the compiler this pattern, so that we can accelerate it |
| vastly. |
| |
| In this tutorial, we chose to vectorize the inner loop row data since it is cache friendly. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 171-192 |
| |
| .. code-block:: default |
| |
| |
| s = te.create_schedule(C.op) |
| mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn) |
| (kaxis,) = s[C].op.reduce_axis |
| ko, ki = s[C].split(kaxis, factor=kfactor) |
| |
| s[C].reorder(mo, no, ko, ki, mi, ni) |
| |
| # Vectorization |
| s[C].vectorize(ni) |
| |
| func = tvm.build(s, [A, B, C], target=target, name="mmult") |
| assert func |
| |
| c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev) |
| func(a, b, c) |
| tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5) |
| |
| evaluator = func.time_evaluator(func.entry_name, dev, number=10) |
| print("Opt2: %f" % evaluator(a, b, c).mean) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Opt2: 0.318767 |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 193-194 |
| |
| Here is the generated IR after vectorization. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 194-197 |
| |
| .. code-block:: default |
| |
| |
| print(tvm.lower(s, [A, B, C], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| for m_outer, n_outer in T.grid(32, 32): |
| C_1 = T.Buffer((1048576,), data=C.data) |
| for m_inner_init in range(32): |
| C_1[m_outer * 32768 + m_inner_init * 1024 + n_outer * 32:m_outer * 32768 + m_inner_init * 1024 + n_outer * 32 + 32] = T.Broadcast(T.float32(0), 32) |
| for k_outer, k_inner, m_inner in T.grid(256, 4, 32): |
| cse_var_3: T.int32 = n_outer * 32 |
| cse_var_2: T.int32 = m_outer * 32768 + m_inner * 1024 |
| cse_var_1: T.int32 = cse_var_2 + cse_var_3 |
| A_1 = T.Buffer((1048576,), data=A.data) |
| B_1 = T.Buffer((1048576,), data=B.data) |
| C_1[cse_var_1:cse_var_1 + 32] = C_1[cse_var_1:cse_var_1 + 32] + T.Broadcast(A_1[cse_var_2 + k_outer * 4 + k_inner], 32) * B_1[k_outer * 4096 + k_inner * 1024 + cse_var_3:k_outer * 4096 + k_inner * 1024 + cse_var_3 + 32] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 198-204 |
| |
| Loop Permutation |
| ---------------- |
| If we look at the above IR, we can see the inner loop row data is vectorized for both B and C. |
| Next we will look at the access pattern of A. In current schedule, A is accessed column by column |
| which is not cache friendly. If we change the nested loop order of ki and inner axes mi, |
| the access pattern for A matrix is more cache friendly. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 204-224 |
| |
| .. code-block:: default |
| |
| |
| s = te.create_schedule(C.op) |
| mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn) |
| (kaxis,) = s[C].op.reduce_axis |
| ko, ki = s[C].split(kaxis, factor=kfactor) |
| |
| # re-ordering |
| s[C].reorder(mo, no, ko, mi, ki, ni) |
| s[C].vectorize(ni) |
| |
| func = tvm.build(s, [A, B, C], target=target, name="mmult") |
| assert func |
| |
| c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev) |
| func(a, b, c) |
| tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5) |
| |
| evaluator = func.time_evaluator(func.entry_name, dev, number=10) |
| print("Opt3: %f" % evaluator(a, b, c).mean) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Opt3: 0.122557 |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 225-226 |
| |
| Here is the generated IR after loop permutation. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 226-229 |
| |
| .. code-block:: default |
| |
| |
| print(tvm.lower(s, [A, B, C], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| for m_outer, n_outer in T.grid(32, 32): |
| C_1 = T.Buffer((1048576,), data=C.data) |
| for m_inner_init in range(32): |
| C_1[m_outer * 32768 + m_inner_init * 1024 + n_outer * 32:m_outer * 32768 + m_inner_init * 1024 + n_outer * 32 + 32] = T.Broadcast(T.float32(0), 32) |
| for k_outer, m_inner, k_inner in T.grid(256, 32, 4): |
| cse_var_3: T.int32 = n_outer * 32 |
| cse_var_2: T.int32 = m_outer * 32768 + m_inner * 1024 |
| cse_var_1: T.int32 = cse_var_2 + cse_var_3 |
| A_1 = T.Buffer((1048576,), data=A.data) |
| B_1 = T.Buffer((1048576,), data=B.data) |
| C_1[cse_var_1:cse_var_1 + 32] = C_1[cse_var_1:cse_var_1 + 32] + T.Broadcast(A_1[cse_var_2 + k_outer * 4 + k_inner], 32) * B_1[k_outer * 4096 + k_inner * 1024 + cse_var_3:k_outer * 4096 + k_inner * 1024 + cse_var_3 + 32] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 230-240 |
| |
| Array Packing |
| ------------- |
| Another important trick is array packing. The trick is to reorder the storage of a multi- |
| dimensional array so that it is accessed sequentially after it is flattened and stored in one- |
| dimensional memory. |
| |
| .. image:: https://github.com/dmlc/web-data/raw/main/tvm/tutorial/array-packing.png |
| :align: center |
| |
| NOTE: This figure is a general illustration of how array packing works. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 243-250 |
| |
| We can use array packing to address the access pattern for B. Observe the array access pattern of |
| B after flattening which is not sequential as we iterate over the K dimension. We can reorder B |
| with dimensions [K][N] so that it has dimensions [N/bn][K][bn] where bn is the blocking factor and |
| also the vector size for B in the inner loop. This reorder splits N into two dimensions --- |
| bigN (N/bn) and littleN (bn) --- and the new dimensions [N/bn][K][bn] match the indexing of B |
| from outer to inner loops (no, ko, ki, ni) resulting in a sequential access pattern for B after |
| flattening. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 250-285 |
| |
| .. code-block:: default |
| |
| |
| |
| # We have to re-write the algorithm slightly. |
| packedB = te.compute( |
| (N / bn, K, bn), lambda bigN, k, littleN: B[k, bigN * bn + littleN], name="packedB" |
| ) |
| C = te.compute( |
| (M, N), |
| lambda m, n: te.sum(A[m, k] * packedB[n // bn, k, tvm.tir.indexmod(n, bn)], axis=k), |
| name="C", |
| ) |
| |
| s = te.create_schedule(C.op) |
| |
| mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn) |
| (kaxis,) = s[C].op.reduce_axis |
| ko, ki = s[C].split(kaxis, factor=kfactor) |
| |
| s[C].reorder(mo, no, ko, mi, ki, ni) |
| s[C].vectorize(ni) |
| |
| bigN, _, littleN = s[packedB].op.axis |
| s[packedB].vectorize(littleN) |
| s[packedB].parallel(bigN) |
| |
| func = tvm.build(s, [A, B, C], target=target, name="mmult") |
| assert func |
| |
| c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev) |
| func(a, b, c) |
| tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5) |
| |
| evaluator = func.time_evaluator(func.entry_name, dev, number=10) |
| print("Opt4: %f" % evaluator(a, b, c).mean) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Opt4: 0.107549 |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 286-287 |
| |
| Here is the generated IR after array packing. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 287-290 |
| |
| .. code-block:: default |
| |
| |
| print(tvm.lower(s, [A, B, C], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| packedB = T.allocate([32768], "float32x32", "global") |
| packedB_1 = T.Buffer((32768,), "float32x32", data=packedB) |
| for bigN in T.parallel(32): |
| for k in range(1024): |
| B_1 = T.Buffer((1048576,), data=B.data) |
| packedB_1[bigN * 1024 + k] = B_1[k * 1024 + bigN * 32:k * 1024 + bigN * 32 + 32] |
| for m_outer, n_outer in T.grid(32, 32): |
| C_1 = T.Buffer((1048576,), data=C.data) |
| for m_inner_init in range(32): |
| C_1[m_outer * 32768 + m_inner_init * 1024 + n_outer * 32:m_outer * 32768 + m_inner_init * 1024 + n_outer * 32 + 32] = T.Broadcast(T.float32(0), 32) |
| for k_outer, m_inner, k_inner in T.grid(256, 32, 4): |
| cse_var_3: T.int32 = m_outer * 32768 + m_inner * 1024 |
| cse_var_2: T.int32 = k_outer * 4 |
| cse_var_1: T.int32 = cse_var_3 + n_outer * 32 |
| A_1 = T.Buffer((1048576,), data=A.data) |
| C_1[cse_var_1:cse_var_1 + 32] = C_1[cse_var_1:cse_var_1 + 32] + T.Broadcast(A_1[cse_var_3 + cse_var_2 + k_inner], 32) * packedB_1[n_outer * 1024 + cse_var_2 + k_inner] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 291-297 |
| |
| Write cache for blocks |
| ---------------------- |
| After blocking, the program will write result to C block by block, the access pattern |
| is not sequential. So we can use a sequential cache array to hold the block results and |
| write to C when all the block results are ready. |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 297-336 |
| |
| .. code-block:: default |
| |
| |
| s = te.create_schedule(C.op) |
| |
| # Allocate write cache |
| CC = s.cache_write(C, "global") |
| |
| mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn) |
| |
| # Write cache is computed at no |
| s[CC].compute_at(s[C], no) |
| |
| # New inner axes |
| mc, nc = s[CC].op.axis |
| |
| (kaxis,) = s[CC].op.reduce_axis |
| ko, ki = s[CC].split(kaxis, factor=kfactor) |
| s[CC].reorder(ko, mc, ki, nc) |
| s[CC].vectorize(nc) |
| |
| # TODO: Add separate optimization step to discuss loop unrolling |
| # unrolling is a loop optimization strategy which can reduce branch |
| # prediction failures and increases the chance of concurrent execution |
| # unroll kfactor loops |
| s[CC].unroll(ki) |
| |
| bigN, _, littleN = s[packedB].op.axis |
| s[packedB].vectorize(littleN) |
| s[packedB].parallel(bigN) |
| |
| func = tvm.build(s, [A, B, C], target=target, name="mmult") |
| assert func |
| |
| c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev) |
| func(a, b, c) |
| tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5) |
| |
| evaluator = func.time_evaluator(func.entry_name, dev, number=10) |
| print("Opt5: %f" % evaluator(a, b, c).mean) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Opt5: 0.112403 |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 337-338 |
| |
| Here is the generated IR after blocking. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 338-341 |
| |
| .. code-block:: default |
| |
| |
| print(tvm.lower(s, [A, B, C], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| packedB = T.allocate([32768], "float32x32", "global") |
| C_global = T.allocate([1024], "float32", "global") |
| packedB_1 = T.Buffer((32768,), "float32x32", data=packedB) |
| for bigN in T.parallel(32): |
| for k in range(1024): |
| B_1 = T.Buffer((1048576,), data=B.data) |
| packedB_1[bigN * 1024 + k] = B_1[k * 1024 + bigN * 32:k * 1024 + bigN * 32 + 32] |
| for m_outer, n_outer in T.grid(32, 32): |
| C_global_1 = T.Buffer((1024,), data=C_global) |
| for m_c_init in range(32): |
| C_global_1[m_c_init * 32:m_c_init * 32 + 32] = T.Broadcast(T.float32(0), 32) |
| for k_outer, m_c in T.grid(256, 32): |
| cse_var_4: T.int32 = k_outer * 4 |
| cse_var_3: T.int32 = m_c * 32 |
| cse_var_2: T.int32 = n_outer * 1024 + cse_var_4 |
| cse_var_1: T.int32 = m_outer * 32768 + m_c * 1024 + cse_var_4 |
| A_1 = T.Buffer((1048576,), data=A.data) |
| C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1], 32) * packedB_1[cse_var_2] |
| C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 1], 32) * packedB_1[cse_var_2 + 1] |
| C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 2], 32) * packedB_1[cse_var_2 + 2] |
| C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 3], 32) * packedB_1[cse_var_2 + 3] |
| for m_inner, n_inner in T.grid(32, 32): |
| C_1 = T.Buffer((1048576,), data=C.data) |
| C_1[m_outer * 32768 + m_inner * 1024 + n_outer * 32 + n_inner] = C_global_1[m_inner * 32 + n_inner] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 342-345 |
| |
| Parallel |
| -------- |
| Furthermore, we can also utilize multi-core processors to do the thread-level parallelization. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 345-380 |
| |
| .. code-block:: default |
| |
| |
| s = te.create_schedule(C.op) |
| |
| CC = s.cache_write(C, "global") |
| |
| mo, no, mi, ni = s[C].tile(C.op.axis[0], C.op.axis[1], bn, bn) |
| |
| s[CC].compute_at(s[C], no) |
| |
| mc, nc = s[CC].op.axis |
| |
| (kaxis,) = s[CC].op.reduce_axis |
| ko, ki = s[CC].split(kaxis, factor=kfactor) |
| s[CC].reorder(ko, mc, ki, nc) |
| s[CC].vectorize(nc) |
| s[CC].unroll(ki) |
| |
| # parallel |
| s[C].parallel(mo) |
| |
| bigN, _, littleN = s[packedB].op.axis |
| s[packedB].vectorize(littleN) |
| s[packedB].parallel(bigN) |
| |
| func = tvm.build(s, [A, B, C], target=target, name="mmult") |
| assert func |
| |
| c = tvm.nd.array(numpy.zeros((M, N), dtype=dtype), dev) |
| func(a, b, c) |
| tvm.testing.assert_allclose(c.numpy(), answer, rtol=1e-5) |
| |
| evaluator = func.time_evaluator(func.entry_name, dev, number=50) |
| opt6_time = evaluator(a, b, c).mean |
| print("Opt6: %f" % opt6_time) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Opt6: 0.135675 |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 381-382 |
| |
| Here is the generated IR after parallelization. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 382-385 |
| |
| .. code-block:: default |
| |
| |
| print(tvm.lower(s, [A, B, C], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.Buffer((1024, 1024), "float32"), B: T.Buffer((1024, 1024), "float32"), C: T.Buffer((1024, 1024), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| packedB = T.allocate([32768], "float32x32", "global") |
| packedB_1 = T.Buffer((32768,), "float32x32", data=packedB) |
| for bigN in T.parallel(32): |
| for k in range(1024): |
| B_1 = T.Buffer((1048576,), data=B.data) |
| packedB_1[bigN * 1024 + k] = B_1[k * 1024 + bigN * 32:k * 1024 + bigN * 32 + 32] |
| for m_outer in T.parallel(32): |
| C_global = T.allocate([1024], "float32", "global") |
| for n_outer in range(32): |
| C_global_1 = T.Buffer((1024,), data=C_global) |
| for m_c_init in range(32): |
| C_global_1[m_c_init * 32:m_c_init * 32 + 32] = T.Broadcast(T.float32(0), 32) |
| for k_outer, m_c in T.grid(256, 32): |
| cse_var_4: T.int32 = k_outer * 4 |
| cse_var_3: T.int32 = m_c * 32 |
| cse_var_2: T.int32 = n_outer * 1024 + cse_var_4 |
| cse_var_1: T.int32 = m_outer * 32768 + m_c * 1024 + cse_var_4 |
| A_1 = T.Buffer((1048576,), data=A.data) |
| C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1], 32) * packedB_1[cse_var_2] |
| C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 1], 32) * packedB_1[cse_var_2 + 1] |
| C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 2], 32) * packedB_1[cse_var_2 + 2] |
| C_global_1[cse_var_3:cse_var_3 + 32] = C_global_1[cse_var_3:cse_var_3 + 32] + T.Broadcast(A_1[cse_var_1 + 3], 32) * packedB_1[cse_var_2 + 3] |
| for m_inner, n_inner in T.grid(32, 32): |
| C_1 = T.Buffer((1048576,), data=C.data) |
| C_1[m_outer * 32768 + m_inner * 1024 + n_outer * 32 + n_inner] = C_global_1[m_inner * 32 + n_inner] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 388-395 |
| |
| Summary |
| ------- |
| After applying the above simple optimizations with only 18 lines of code, |
| our generated code can achieve 60% of the `numpy` performance with MKL. |
| Note that the outputs on the web page reflect the running times on a non-exclusive |
| Docker container, thereby they are *unreliable*. It is highly encouraged to run the |
| tutorial by yourself to observe the performance gain achieved by TVM. |
| |
| |
| .. _sphx_glr_download_how_to_optimize_operators_opt_gemm.py: |
| |
| .. only:: html |
| |
| .. container:: sphx-glr-footer sphx-glr-footer-example |
| |
| |
| .. container:: sphx-glr-download sphx-glr-download-python |
| |
| :download:`Download Python source code: opt_gemm.py <opt_gemm.py>` |
| |
| .. container:: sphx-glr-download sphx-glr-download-jupyter |
| |
| :download:`Download Jupyter notebook: opt_gemm.ipynb <opt_gemm.ipynb>` |
| |
| |
| .. only:: html |
| |
| .. rst-class:: sphx-glr-signature |
| |
| `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_ |