| |
| .. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY |
| .. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE |
| .. CHANGES, EDIT THE SOURCE PYTHON FILE: |
| .. "how_to/tune_with_autoscheduler/tune_network_cuda.py" |
| |
| .. only:: html |
| |
| .. note:: |
| :class: sphx-glr-download-link-note |
| |
| This tutorial can be used interactively with Google Colab! You can also click |
| :ref:`here <sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_cuda.py>` to run the Jupyter notebook locally. |
| |
| .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg |
| :align: center |
| :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/af264436d049e3cd84803b67b6620b63/tune_network_cuda.ipynb |
| :width: 300px |
| |
| .. rst-class:: sphx-glr-example-title |
| |
| .. _sphx_glr_how_to_tune_with_autoscheduler_tune_network_cuda.py: |
| |
| |
| Auto-scheduling a Neural Network for NVIDIA GPU |
| =============================================== |
| **Author**: `Lianmin Zheng <https://github.com/merrymercy>`_ |
| |
| Auto-tuning for specific devices and workloads is critical for getting the |
| best performance. This is a tutorial on how to tune a whole neural |
| network for NVIDIA GPU with the auto-scheduler. |
| |
| To auto-tune a neural network, we partition the network into small subgraphs and |
| tune them independently. Each subgraph is treated as one search task. |
| A task scheduler slices the time and dynamically allocates time resources to |
| these tasks. The task scheduler predicts the impact of each task on the end-to-end |
| execution time and prioritizes the one that can reduce the execution time the most. |
| |
| For each subgraph, we use the compute declaration in :code:`tvm/python/topi` to |
| get the computational DAG in the tensor expression form. |
| We then use the auto-scheduler to construct a search space of this DAG and search |
| for good schedules (low-level optimizations). |
| |
| Different from the template-based :ref:`autotvm <tutorials-autotvm-sec>` which relies on |
| manual templates to define the search space, the auto-scheduler does not require any |
| schedule templates. In other words, the auto-scheduler only uses the compute declarations |
| in :code:`tvm/python/topi` and does not use existing schedule templates. |
| |
| Note that this tutorial will not run on Windows or recent versions of macOS. To |
| get it to run, you will need to wrap the body of this tutorial in a :code:`if |
| __name__ == "__main__":` block. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 46-55 |
| |
| .. code-block:: default |
| |
| |
| import sys |
| import numpy as np |
| |
| import tvm |
| from tvm import relay, auto_scheduler |
| import tvm.relay.testing |
| from tvm.contrib import graph_executor |
| |
| |
| |
| |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 56-68 |
| |
| Define a Network |
| ---------------- |
| First, we need to define the network with relay frontend API. |
| We can load some pre-defined network from :code:`tvm.relay.testing`. |
| We can also load models from MXNet, ONNX, PyTorch, and TensorFlow |
| (see :ref:`front end tutorials<tutorial-frontend>`). |
| |
| For convolutional neural networks, although auto-scheduler can work correctly |
| with any layout, we found the best performance is typically achieved with NHWC layout. |
| We also implemented more optimizations for NHWC layout with the auto-scheduler. |
| So it is recommended to convert your models to NHWC layout to use the auto-scheduler. |
| You can use :ref:`ConvertLayout <convert-layout-usage>` pass to do the layout conversion in TVM. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 68-128 |
| |
| .. code-block:: default |
| |
| |
| |
| def get_network(name, batch_size, layout="NHWC", dtype="float32"): |
| """Get the symbol definition and random weight of a network""" |
| |
| # auto-scheduler prefers NHWC layout |
| if layout == "NHWC": |
| image_shape = (224, 224, 3) |
| elif layout == "NCHW": |
| image_shape = (3, 224, 224) |
| else: |
| raise ValueError("Invalid layout: " + layout) |
| |
| input_shape = (batch_size,) + image_shape |
| output_shape = (batch_size, 1000) |
| |
| if name.startswith("resnet-"): |
| n_layer = int(name.split("-")[1]) |
| mod, params = relay.testing.resnet.get_workload( |
| num_layers=n_layer, |
| batch_size=batch_size, |
| layout=layout, |
| dtype=dtype, |
| image_shape=image_shape, |
| ) |
| elif name.startswith("resnet3d-"): |
| n_layer = int(name.split("-")[1]) |
| mod, params = relay.testing.resnet.get_workload( |
| num_layers=n_layer, |
| batch_size=batch_size, |
| layout=layout, |
| dtype=dtype, |
| image_shape=image_shape, |
| ) |
| elif name == "mobilenet": |
| mod, params = relay.testing.mobilenet.get_workload( |
| batch_size=batch_size, layout=layout, dtype=dtype, image_shape=image_shape |
| ) |
| elif name == "squeezenet_v1.1": |
| assert layout == "NCHW", "squeezenet_v1.1 only supports NCHW layout" |
| mod, params = relay.testing.squeezenet.get_workload( |
| version="1.1", |
| batch_size=batch_size, |
| dtype=dtype, |
| image_shape=image_shape, |
| ) |
| elif name == "inception_v3": |
| input_shape = (batch_size, 3, 299, 299) if layout == "NCHW" else (batch_size, 299, 299, 3) |
| mod, params = relay.testing.inception_v3.get_workload(batch_size=batch_size, dtype=dtype) |
| return mod, params, input_shape, output_shape |
| |
| |
| # Define the neural network and compilation target |
| network = "resnet-18" |
| batch_size = 1 |
| layout = "NHWC" |
| target = tvm.target.Target("cuda") |
| dtype = "float32" |
| log_file = "%s-%s-B%d-%s.json" % (network, layout, batch_size, target.kind.name) |
| |
| |
| |
| |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 129-138 |
| |
| Extract Search Tasks |
| -------------------- |
| Next, we extract the search tasks and their weights from a network. |
| The weight of a task is the number of appearances of the task's subgraph |
| in the whole network. |
| By using the weight, we can approximate the end-to-end latency of the network |
| as :code:`sum(latency[t] * weight[t])`, where :code:`latency[t]` is the |
| latency of a task and :code:`weight[t]` is the weight of the task. |
| The task scheduler will just optimize this objective. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 138-148 |
| |
| .. code-block:: default |
| |
| |
| # Extract tasks from the network |
| print("Extract tasks...") |
| mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype) |
| tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target) |
| |
| for idx, task in enumerate(tasks): |
| print("========== Task %d (workload key: %s) ==========" % (idx, task.workload_key)) |
| print(task.compute_dag) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Extract tasks... |
| ========== Task 0 (workload key: ["2d10de6646307f0e3e5cf4b31c20e69b", [1, 56, 56, 64], [1, 1, 64, 64], [1, 56, 56, 64]]) ========== |
| p0 = PLACEHOLDER [1, 56, 56, 64] |
| pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3] |
| p1 = PLACEHOLDER [1, 1, 64, 64] |
| conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, (yy + ry), (xx + rx), rc]*p1[ry, rx, rc, ff]) |
| |
| ========== Task 1 (workload key: ["07f9fcad27bdd3233f86fe35a5185d33", [1, 56, 56, 64], [3, 3, 64, 128], [1, 1, 1, 128], [1, 28, 28, 128]]) ========== |
| p0 = PLACEHOLDER [1, 56, 56, 64] |
| pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| p1 = PLACEHOLDER [3, 3, 64, 128] |
| conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff]) |
| p2 = PLACEHOLDER [1, 1, 1, 128] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 2 (workload key: ["08f7449d79e570b7274174709e5e5e01", [1, 512], [1000, 512], [1, 1000], [1, 1000]]) ========== |
| p0 = PLACEHOLDER [1, 512] |
| p1 = PLACEHOLDER [1000, 512] |
| T_matmul_NT(i0, i1) += (p0[i0, k]*p1[i1, k]) |
| p2 = PLACEHOLDER [1, 1000] |
| T_add(ax0, ax1) = (T_matmul_NT[ax0, ax1] + p2[ax0, ax1]) |
| |
| ========== Task 3 (workload key: ["0fad1b42d0d33418e0a8d15d3bbad3c9", [1, 14, 14, 256], [1, 1, 256, 512], [1, 7, 7, 512]]) ========== |
| p0 = PLACEHOLDER [1, 14, 14, 256] |
| pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3] |
| p1 = PLACEHOLDER [1, 1, 256, 512] |
| conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff]) |
| |
| ========== Task 4 (workload key: ["d78e8eb6021c4cdda0ad7775d10f751a", [1, 7, 7, 512], [4, 4, 512, 512], [1, 7, 7, 512], [1, 7, 7, 512]]) ========== |
| p0 = PLACEHOLDER [1, 7, 7, 512] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 16), ((floormod(floordiv(p, 4), 4)*2) + eps), ((floormod(p, 4)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 512, 512] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*4)*4) + (floordiv(h, 2)*4)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 7, 7, 512] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3]) |
| |
| ========== Task 5 (workload key: ["8c53ca2904398da2889aa7508082d7bb", [1, 7, 7, 512], [1, 1, 1, 512]]) ========== |
| p0 = PLACEHOLDER [1, 7, 7, 512] |
| adaptive_pool_sum(ax0, ax1, ax2, ax3) += p0[ax0, ((ax1*7) + rv0), ((ax2*7) + rv1), ax3] |
| adaptive_pool_avg(ax0, ax1, ax2, ax3) = (adaptive_pool_sum[ax0, ax1, ax2, ax3]/(float32((select((bool)1, ((ax1 + 1)*7), (((ax1 + 1)*7) + 1)) - (ax1*7)))*float32((select((bool)1, ((ax2 + 1)*7), (((ax2 + 1)*7) + 1)) - (ax2*7))))) |
| |
| ========== Task 6 (workload key: ["25577781e50c611c2e45e73c1cb3a6ca", [1, 28, 28, 128], [4, 4, 128, 128], [1, 28, 28, 128], [1, 28, 28, 128]]) ========== |
| p0 = PLACEHOLDER [1, 28, 28, 128] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*2) + eps), ((floormod(p, 14)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 128, 128] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*14)*14) + (floordiv(h, 2)*14)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 28, 28, 128] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3]) |
| |
| ========== Task 7 (workload key: ["0fad1b42d0d33418e0a8d15d3bbad3c9", [1, 56, 56, 64], [1, 1, 64, 128], [1, 28, 28, 128]]) ========== |
| p0 = PLACEHOLDER [1, 56, 56, 64] |
| pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3] |
| p1 = PLACEHOLDER [1, 1, 64, 128] |
| conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff]) |
| |
| ========== Task 8 (workload key: ["7d79c516e212fe1d73f5dbb90eaca2cf", [1, 1000], [1, 1000]]) ========== |
| p0 = PLACEHOLDER [1, 1000] |
| T_softmax_maxelem(i0) max= p0[i0, k] |
| T_softmax_exp(i0, i1) = tir.exp((p0[i0, i1] - T_softmax_maxelem[i0])) |
| T_softmax_expsum(i0) += T_softmax_exp[i0, k] |
| T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0]) |
| |
| ========== Task 9 (workload key: ["6d012ba18a086c11ee2b85c7324e16f2", [1, 112, 112, 64], [1, 1, 1, 64], [1, 56, 56, 64]]) ========== |
| p0 = PLACEHOLDER [1, 112, 112, 64] |
| pad_temp(ax0, ax1, ax2, ax3) = tir.if_then_else(((((ax1 >= 1) && (ax1 < 113)) && (ax2 >= 1)) && (ax2 < 113)), p0[ax0, (ax1 - 1), (ax2 - 1), ax3], -3.40282e+38f) |
| pool_max(ax0, ax1, ax2, ax3) max= pad_temp[ax0, ((ax1*2) + rv0), ((ax2*2) + rv1), ax3] |
| p1 = PLACEHOLDER [1, 1, 1, 64] |
| T_add(ax0, ax1, ax2, ax3) = (pool_max[ax0, ax1, ax2, ax3] + p1[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 10 (workload key: ["f19692ed81d032b1697c08adee62f9a5", [1, 28, 28, 128], [4, 4, 128, 128], [1, 28, 28, 128], [1, 1, 1, 128], [1, 28, 28, 128]]) ========== |
| p0 = PLACEHOLDER [1, 28, 28, 128] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*2) + eps), ((floormod(p, 14)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 128, 128] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*14)*14) + (floordiv(h, 2)*14)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 28, 28, 128] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3]) |
| p3 = PLACEHOLDER [1, 1, 1, 128] |
| T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + p3[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 11 (workload key: ["40b1cf1fd37b0ef111b3cc0247302508", [1, 7, 7, 512], [4, 4, 512, 512], [1, 1, 1, 512], [1, 7, 7, 512]]) ========== |
| p0 = PLACEHOLDER [1, 7, 7, 512] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 16), ((floormod(floordiv(p, 4), 4)*2) + eps), ((floormod(p, 4)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 512, 512] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*4)*4) + (floordiv(h, 2)*4)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 1, 1, 512] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 12 (workload key: ["0bcf718c0e6566bcd6c3b1437a3b6291", [1, 28, 28, 128], [4, 4, 128, 128], [1, 1, 1, 128], [1, 28, 28, 128]]) ========== |
| p0 = PLACEHOLDER [1, 28, 28, 128] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*2) + eps), ((floormod(p, 14)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 128, 128] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*14)*14) + (floordiv(h, 2)*14)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 1, 1, 128] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 13 (workload key: ["07f9fcad27bdd3233f86fe35a5185d33", [1, 14, 14, 256], [3, 3, 256, 512], [1, 1, 1, 512], [1, 7, 7, 512]]) ========== |
| p0 = PLACEHOLDER [1, 14, 14, 256] |
| pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| p1 = PLACEHOLDER [3, 3, 256, 512] |
| conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff]) |
| p2 = PLACEHOLDER [1, 1, 1, 512] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 14 (workload key: ["64b7ce5264a64cb340d78b444b0325e6", [1, 14, 14, 256], [4, 4, 256, 256], [1, 14, 14, 256], [1, 1, 1, 256], [1, 14, 14, 256]]) ========== |
| p0 = PLACEHOLDER [1, 14, 14, 256] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 49), ((floormod(floordiv(p, 7), 7)*2) + eps), ((floormod(p, 7)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 256, 256] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*7)*7) + (floordiv(h, 2)*7)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 14, 14, 256] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3]) |
| p3 = PLACEHOLDER [1, 1, 1, 256] |
| T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + p3[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 15 (workload key: ["6c4f6234946e16bcf9e48bdf289f9200", [1, 56, 56, 64], [6, 6, 64, 64], [1, 56, 56, 64], [1, 1, 1, 64], [1, 56, 56, 64]]) ========== |
| p0 = PLACEHOLDER [1, 56, 56, 64] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*4) + eps), ((floormod(p, 14)*4) + nu), ci] |
| B(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 6) == 5)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 6) == 4)), ..(OMITTED).. (floormod(j, 6) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 6) == 0)), 1f, 0f)))))))))))))))))))))))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [6, 6, 64, 64] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 4) == 2)), ..(OMITTED).. 6) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))))))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 4), floormod(w, 4), ((((n*14)*14) + (floordiv(h, 4)*14)) + floordiv(w, 4)), co] |
| p2 = PLACEHOLDER [1, 56, 56, 64] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3]) |
| p3 = PLACEHOLDER [1, 1, 1, 64] |
| T_add(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3] + p3[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 16 (workload key: ["0fad1b42d0d33418e0a8d15d3bbad3c9", [1, 28, 28, 128], [1, 1, 128, 256], [1, 14, 14, 256]]) ========== |
| p0 = PLACEHOLDER [1, 28, 28, 128] |
| pad_temp(i0, i1, i2, i3) = p0[i0, i1, i2, i3] |
| p1 = PLACEHOLDER [1, 1, 128, 256] |
| conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff]) |
| |
| ========== Task 17 (workload key: ["07f9fcad27bdd3233f86fe35a5185d33", [1, 224, 224, 3], [7, 7, 3, 64], [1, 1, 1, 64], [1, 112, 112, 64]]) ========== |
| p0 = PLACEHOLDER [1, 224, 224, 3] |
| pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 3) && (i1 < 227)) && (i2 >= 3)) && (i2 < 227)), p0[i0, (i1 - 3), (i2 - 3), i3], 0f) |
| p1 = PLACEHOLDER [7, 7, 3, 64] |
| conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff]) |
| p2 = PLACEHOLDER [1, 1, 1, 64] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 18 (workload key: ["7f3fee61bc3c2604395f5d343b840b7c", [1, 14, 14, 256], [4, 4, 256, 256], [1, 14, 14, 256], [1, 14, 14, 256]]) ========== |
| p0 = PLACEHOLDER [1, 14, 14, 256] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 49), ((floormod(floordiv(p, 7), 7)*2) + eps), ((floormod(p, 7)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 256, 256] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*7)*7) + (floordiv(h, 2)*7)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 14, 14, 256] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3]) |
| |
| ========== Task 19 (workload key: ["10b8215aaf2e14d47d40b4093e6f41a0", [1, 56, 56, 64], [6, 6, 64, 64], [1, 56, 56, 64], [1, 56, 56, 64]]) ========== |
| p0 = PLACEHOLDER [1, 56, 56, 64] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*4) + eps), ((floormod(p, 14)*4) + nu), ci] |
| B(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 6) == 5)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 6) == 4)), ..(OMITTED).. (floormod(j, 6) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 6) == 0)), 1f, 0f)))))))))))))))))))))))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [6, 6, 64, 64] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 4) == 2)), ..(OMITTED).. 6) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))))))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 4), floormod(w, 4), ((((n*14)*14) + (floordiv(h, 4)*14)) + floordiv(w, 4)), co] |
| p2 = PLACEHOLDER [1, 56, 56, 64] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3]) |
| |
| ========== Task 20 (workload key: ["07f9fcad27bdd3233f86fe35a5185d33", [1, 28, 28, 128], [3, 3, 128, 256], [1, 1, 1, 256], [1, 14, 14, 256]]) ========== |
| p0 = PLACEHOLDER [1, 28, 28, 128] |
| pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| p1 = PLACEHOLDER [3, 3, 128, 256] |
| conv2d_nhwc(nn, yy, xx, ff) += (pad_temp[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*p1[ry, rx, rc, ff]) |
| p2 = PLACEHOLDER [1, 1, 1, 256] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_nhwc[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 21 (workload key: ["a3df19e5b88592ef5a9ce584a1ca3010", [1, 7, 7, 512], [4, 4, 512, 512], [1, 7, 7, 512], [1, 1, 1, 512], [1, 1, 1, 512], [1, 7, 7, 512]]) ========== |
| p0 = PLACEHOLDER [1, 7, 7, 512] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 8)) && (i2 >= 1)) && (i2 < 8)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 16), ((floormod(floordiv(p, 4), 4)*2) + eps), ((floormod(p, 4)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 512, 512] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*4)*4) + (floordiv(h, 2)*4)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 7, 7, 512] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, ax1, ax2, ax3]) |
| p3 = PLACEHOLDER [1, 1, 1, 512] |
| T_multiply(ax0, ax1, ax2, ax3) = (T_add[ax0, ax1, ax2, ax3]*p3[ax0, 0, 0, ax3]) |
| p4 = PLACEHOLDER [1, 1, 1, 512] |
| T_add(ax0, ax1, ax2, ax3) = (T_multiply[ax0, ax1, ax2, ax3] + p4[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 22 (workload key: ["7c2a4f1f432f81c44985590780dfb52d", [1, 56, 56, 64], [6, 6, 64, 64], [1, 1, 1, 64], [1, 56, 56, 64]]) ========== |
| p0 = PLACEHOLDER [1, 56, 56, 64] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 57)) && (i2 >= 1)) && (i2 < 57)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*4) + eps), ((floormod(p, 14)*4) + nu), ci] |
| B(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 6) == 5)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 6) == 4)), ..(OMITTED).. (floormod(j, 6) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 6) == 0)), 1f, 0f)))))))))))))))))))))))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [6, 6, 64, 64] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 6) == 5) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 6) == 5) && (floormod(j, 4) == 2)), ..(OMITTED).. 6) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 6) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))))))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 4), floormod(w, 4), ((((n*14)*14) + (floordiv(h, 4)*14)) + floordiv(w, 4)), co] |
| p2 = PLACEHOLDER [1, 1, 1, 64] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| ========== Task 23 (workload key: ["1097323f3970e5c881ad3a0028ca79cb", [1, 14, 14, 256], [4, 4, 256, 256], [1, 1, 1, 256], [1, 14, 14, 256]]) ========== |
| p0 = PLACEHOLDER [1, 14, 14, 256] |
| data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 15)) && (i2 >= 1)) && (i2 < 15)), p0[i0, (i1 - 1), (i2 - 1), i3], 0f) |
| input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 49), ((floormod(floordiv(p, 7), 7)*2) + eps), ((floormod(p, 7)*2) + nu), ci] |
| B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)), ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f)))))))))))))))) |
| data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu]) |
| p1 = PLACEHOLDER [4, 4, 256, 256] |
| bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*p1[eps, nu, co, ci]) |
| A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)), ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f)))))))) |
| inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw]) |
| conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*7)*7) + (floordiv(h, 2)*7)) + floordiv(w, 2)), co] |
| p2 = PLACEHOLDER [1, 1, 1, 256] |
| T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + p2[ax0, 0, 0, ax3]) |
| T_relu(ax0, ax1, ax2, ax3) = max(T_add[ax0, ax1, ax2, ax3], 0f) |
| |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 149-171 |
| |
| Begin Tuning |
| ------------ |
| Now, we set some options for tuning and launch the search tasks |
| |
| * :code:`measure_ctx` launches a different process for measurement to |
| provide isolation. It can protect the main process from GPU crashes |
| during measurement and avoid other runtime conflicts. |
| * :code:`min_repeat_ms` defines the minimum duration of one "repeat" in every measurement. |
| This can warmup the GPU, which is necessary to get accurate measurement results. |
| Typically, we recommend a value >= 300 ms. |
| * :code:`num_measure_trials` is the number of measurement trials we can use during the tuning. |
| You can set it to a small number (e.g., 200) for a fast demonstrative run. |
| In practice, we recommend setting it around :code:`900 * len(tasks)`, |
| which is typically enough for the search to converge. |
| For example, there are 24 tasks in resnet-18, so we can set it as 20000. |
| You can adjust this parameter according to your time budget. |
| * In addition, we use :code:`RecordToFile` to dump measurement records into a log file, |
| The measurement records can be used to query the history best, resume the search, |
| and do more analyses later. |
| * see :any:`auto_scheduler.TuningOptions`, |
| :any:`auto_scheduler.LocalRPCMeasureContext` for more parameters. |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 171-193 |
| |
| .. code-block:: default |
| |
| |
| |
| def run_tuning(): |
| print("Begin tuning...") |
| measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=300, timeout=10) |
| |
| tuner = auto_scheduler.TaskScheduler(tasks, task_weights) |
| tune_option = auto_scheduler.TuningOptions( |
| num_measure_trials=200, # change this to 20000 to achieve the best performance |
| runner=measure_ctx.runner, |
| measure_callbacks=[auto_scheduler.RecordToFile(log_file)], |
| ) |
| |
| tuner.tune(tune_option) |
| |
| |
| # We do not run the tuning in our webpage server since it takes too long. |
| # Uncomment the following line to run it by yourself. |
| |
| # run_tuning() |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 194-247 |
| |
| .. note:: Explain the printed information during tuning |
| |
| During the tuning, a lot of information will be printed on the console. |
| They are used for debugging purposes. The most important info is the output |
| of the task scheduler. The following table is a sample output. |
| |
| .. code-block:: c |
| |
| ---------------------------------------------------------------------- |
| ------------------------------ [ Task Scheduler ] |
| ---------------------------------------------------------------------- |
| | ID | Latency (ms) | Speed (GFLOPS) | Trials | |
| ------------------------------------------------- |
| | 0 | 0.005 | 0.88 | 64 | |
| | 1 | 0.010 | 99.10 | 64 | |
| | 2 | 0.006 | 0.00 | 64 | |
| | 3 | 0.145 | 979.78 | 384 | |
| | 4 | 0.130 | 1097.02 | 384 | |
| | 5 | 0.143 | 992.69 | 384 | |
| | 6 | 0.076 | 1526.86 | 192 | |
| | 7 | 0.115 | 999.44 | 320 | |
| | 8 | 0.079 | 1449.39 | 320 | |
| | 9 | 0.122 | 938.73 | 384 | |
| | 10 | 0.063 | 1832.98 | 192 | |
| | 11 | 0.072 | 1763.62 | 256 | |
| | 12 | 0.062 | 2036.40 | 192 | |
| | 13 | 0.068 | 1874.44 | 192 | |
| | 14 | 0.049 | 2346.50 | 128 | |
| | 15 | 0.076 | 1694.31 | 256 | |
| | 16 | 0.067 | 1933.30 | 448 | |
| | 17 | 0.076 | 1680.90 | 256 | |
| | 18 | 0.022 | 98.43 | 64 | |
| | 19 | 0.076 | 3112.55 | 192 | |
| | 20 | 0.013 | 2026.44 | 64 | |
| | 21 | 0.011 | 1136.69 | 64 | |
| | 22 | 0.013 | 992.47 | 64 | |
| | 23 | 0.020 | 627.56 | 64 | |
| ------------------------------------------------- |
| Estimated total latency: 1.587 ms Trials: 4992 Used time : 13296 s Next ID: 3 |
| |
| This table lists the latency and (estimated) speed of all tasks. |
| It also lists the allocation of measurement trials for all tasks. |
| The last line prints the total weighted latency of these tasks, |
| which can be a rough estimation of the end-to-end execution time |
| of the network. |
| The last line also prints the total number of measurement trials, |
| total time spent on auto-tuning and the id of the next task to tune. |
| |
| There will also be some "tvm::Error"s and CUDA errors, because the |
| auto-scheduler will try some invalid schedules. |
| You can safely ignore them if the tuning can continue, because these |
| errors are isolated from the main process. |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 249-255 |
| |
| .. note:: Terminate the tuning earlier |
| |
| You can terminate the tuning earlier by forcibly killing this process. |
| As long as you get at least one valid schedule for each task in the log file, |
| you should be able to do the compilation (the secion below). |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 258-263 |
| |
| Compile and Evaluate |
| -------------------- |
| After auto-tuning, we can compile the network with the best schedules we found. |
| All measurement records are dumped into the log file during auto-tuning, |
| so we can read the log file and load the best schedules. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 263-281 |
| |
| .. code-block:: default |
| |
| |
| # Compile with the history best |
| print("Compile...") |
| with auto_scheduler.ApplyHistoryBest(log_file): |
| with tvm.transform.PassContext(opt_level=3, config={"relay.backend.use_auto_scheduler": True}): |
| lib = relay.build(mod, target=target, params=params) |
| |
| # Create graph executor |
| dev = tvm.device(str(target), 0) |
| module = graph_executor.GraphModule(lib["default"](dev)) |
| data_tvm = tvm.nd.array((np.random.uniform(size=input_shape)).astype(dtype)) |
| module.set_input("data", data_tvm) |
| |
| # Evaluate |
| print("Evaluate inference time cost...") |
| print(module.benchmark(dev, repeat=3, min_repeat_ms=500)) |
| |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| Compile... |
| Evaluate inference time cost... |
| Execution time summary: |
| mean (ms) median (ms) max (ms) min (ms) std (ms) |
| 3.2379 3.2379 3.2395 3.2363 0.0013 |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 282-298 |
| |
| Other Tips |
| ---------- |
| 1. During the tuning, the auto-scheduler needs to compile many programs and |
| extract feature from them. This part is CPU-intensive, |
| so a high-performance CPU with many cores is recommended for faster search. |
| 2. You can use :code:`python3 -m tvm.auto_scheduler.measure_record --mode distill -i log.json` |
| to distill the large log file and only save the best useful records. |
| 3. You can resume a search from the previous log file. You just need to |
| add a new argument :code:`load_log_file` when creating the task scheduler |
| in function :code:`run_tuning`. Say, |
| :code:`tuner = auto_scheduler.TaskScheduler(tasks, task_weights, load_log_file=log_file)` |
| 4. If you have multiple target GPUs, you can use all of them for measurements to |
| parallelize the measurements. Check this :ref:`section <tutorials-autotvm-scale-up-rpc-tracker>` |
| to learn how to use the RPC Tracker and RPC Server. |
| To use the RPC Tracker in auto-scheduler, replace the runner in :code:`TuningOptions` |
| with :any:`auto_scheduler.RPCRunner`. |
| |
| |
| .. rst-class:: sphx-glr-timing |
| |
| **Total running time of the script:** ( 1 minutes 16.838 seconds) |
| |
| |
| .. _sphx_glr_download_how_to_tune_with_autoscheduler_tune_network_cuda.py: |
| |
| .. only:: html |
| |
| .. container:: sphx-glr-footer sphx-glr-footer-example |
| |
| |
| .. container:: sphx-glr-download sphx-glr-download-python |
| |
| :download:`Download Python source code: tune_network_cuda.py <tune_network_cuda.py>` |
| |
| .. container:: sphx-glr-download sphx-glr-download-jupyter |
| |
| :download:`Download Jupyter notebook: tune_network_cuda.ipynb <tune_network_cuda.ipynb>` |
| |
| |
| .. only:: html |
| |
| .. rst-class:: sphx-glr-signature |
| |
| `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_ |