| |
| .. DO NOT EDIT. THIS FILE WAS AUTOMATICALLY GENERATED BY |
| .. TVM'S MONKEY-PATCHED VERSION OF SPHINX-GALLERY. TO MAKE |
| .. CHANGES, EDIT THE SOURCE PYTHON FILE: |
| .. "tutorial/intro_topi.py" |
| |
| .. only:: html |
| |
| .. note:: |
| :class: sphx-glr-download-link-note |
| |
| This tutorial can be used interactively with Google Colab! You can also click |
| :ref:`here <sphx_glr_download_tutorial_intro_topi.py>` to run the Jupyter notebook locally. |
| |
| .. image:: https://raw.githubusercontent.com/tlc-pack/web-data/main/images/utilities/colab_button.svg |
| :align: center |
| :target: https://colab.research.google.com/github/apache/tvm-site/blob/asf-site/docs/_downloads/63f9e50204143ea3c2d3593c72439b3d/intro_topi.ipynb |
| :width: 300px |
| |
| .. rst-class:: sphx-glr-example-title |
| |
| .. _sphx_glr_tutorial_intro_topi.py: |
| |
| |
| .. _tutorial-topi: |
| |
| Introduction to TOPI |
| ==================== |
| **Author**: `Ehsan M. Kermani <https://github.com/ehsanmok>`_ |
| |
| This is an introductory tutorial to TVM Operator Inventory (TOPI). |
| TOPI provides numpy-style generic operations and schedules with higher abstractions than TVM. |
| In this tutorial, we will see how TOPI can save us from writing boilerplate code in TVM. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 28-35 |
| |
| .. code-block:: default |
| |
| |
| import tvm |
| import tvm.testing |
| from tvm import te |
| from tvm import topi |
| import numpy as np |
| |
| |
| |
| |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 39-45 |
| |
| Basic example |
| ------------- |
| Let's revisit the sum of rows operation (equivalent to :code:`B = numpy.sum(A, axis=1)`') \ |
| To compute the sum of rows of a two dimensional TVM tensor A, we should |
| specify the symbolic operation as well as schedule as follows |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 45-52 |
| |
| .. code-block:: default |
| |
| n = te.var("n") |
| m = te.var("m") |
| A = te.placeholder((n, m), name="A") |
| k = te.reduce_axis((0, m), "k") |
| B = te.compute((n,), lambda i: te.sum(A[i, k], axis=k), name="B") |
| s = te.create_schedule(B.op) |
| |
| |
| |
| |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 53-55 |
| |
| and to examine the IR code in human readable format, we can do |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 55-57 |
| |
| .. code-block:: default |
| |
| print(tvm.lower(s, [A], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.handle): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| n, m = T.int32(), T.int32() |
| A_1 = T.match_buffer(A, (n, m), strides=("stride", "stride"), buffer_type="auto") |
| B = T.allocate([n], "float32", "global") |
| for i in range(n): |
| B_1 = T.Buffer((n,), data=B) |
| B_1[i] = T.float32(0) |
| for k in range(m): |
| A_2 = T.Buffer((A_1.strides[0] * n,), data=A_1.data, buffer_type="auto") |
| B_1[i] = B_1[i] + A_2[i * A_1.strides[0] + k * A_1.strides[1]] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 58-62 |
| |
| However, for such a common operation we had to define the reduce axis ourselves as well as explicit computation with |
| :code:`te.compute`. Imagine for more complicated operations how much details we need to provide. |
| Fortunately, we can replace those two lines with simple :code:`topi.sum` much like :code:`numpy.sum` |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 62-66 |
| |
| .. code-block:: default |
| |
| C = topi.sum(A, axis=1) |
| ts = te.create_schedule(C.op) |
| print(tvm.lower(ts, [A], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(A: T.handle): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| n, m = T.int32(), T.int32() |
| A_1 = T.match_buffer(A, (n, m), strides=("stride", "stride"), buffer_type="auto") |
| A_red = T.allocate([n], "float32", "global") |
| for ax0 in range(n): |
| A_red_1 = T.Buffer((n,), data=A_red) |
| A_red_1[ax0] = T.float32(0) |
| for k1 in range(m): |
| A_2 = T.Buffer((A_1.strides[0] * n,), data=A_1.data, buffer_type="auto") |
| A_red_1[ax0] = A_red_1[ax0] + A_2[ax0 * A_1.strides[0] + k1 * A_1.strides[1]] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 67-72 |
| |
| Numpy-style operator overloading |
| -------------------------------- |
| We can add two tensors using :code:`topi.broadcast_add` that have correct (broadcastable with specific) shapes. |
| Even shorter, TOPI provides operator overloading for such common operations. For example, |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 72-78 |
| |
| .. code-block:: default |
| |
| x, y = 100, 10 |
| a = te.placeholder((x, y, y), name="a") |
| b = te.placeholder((y, y), name="b") |
| c = a + b # same as topi.broadcast_add |
| d = a * b # same as topi.broadcast_mul |
| |
| |
| |
| |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 79-80 |
| |
| Overloaded with the same syntax, TOPI handles broadcasting a primitive (`int`, `float`) to a tensor :code:`d - 3.14`. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 82-90 |
| |
| Generic schedules and fusing operations |
| --------------------------------------- |
| Up to now, we have seen an example of how TOPI can save us from writing explicit computations in lower level API. |
| But it doesn't stop here. Still we did the scheduling as before. TOPI also provides higher level |
| scheduling recipes depending on a given context. For example, for CUDA, |
| we can schedule the following series of operations ending with :code:`topi.sum` using only |
| :code:`topi.generic.schedule_reduce` |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 90-97 |
| |
| .. code-block:: default |
| |
| e = topi.elemwise_sum([c, d]) |
| f = e / 2.0 |
| g = topi.sum(f) |
| with tvm.target.cuda(): |
| sg = topi.cuda.schedule_reduce(g) |
| print(tvm.lower(sg, [a, b], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| /workspace/python/tvm/target/target.py:446: UserWarning: Try specifying cuda arch by adding 'arch=sm_xx' to your target. |
| warnings.warn("Try specifying cuda arch by adding 'arch=sm_xx' to your target.") |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(a: T.Buffer((100, 10, 10), "float32"), b: T.Buffer((10, 10), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| T_divide_red = T.allocate([1], "float32", "global") |
| threadIdx_x = T.launch_thread("threadIdx.x", 1024) |
| T_divide_red_rf = T.allocate([1], "float32", "local") |
| reduce_temp0 = T.allocate([1], "float32", "local") |
| T_divide_red_rf_1 = T.Buffer((1,), data=T_divide_red_rf, scope="local", align=4) |
| T_divide_red_rf_1[0] = T.float32(0) |
| for k0_k1_fused_k2_fused_outer in range(10): |
| if T.likely(k0_k1_fused_k2_fused_outer * 64 + threadIdx_x // 16 < 625 and k0_k1_fused_k2_fused_outer * 64 + threadIdx_x // 16 < 625 and k0_k1_fused_k2_fused_outer * 64 + threadIdx_x // 16 < 625): |
| a_1 = T.Buffer((10000,), data=a.data) |
| b_1 = T.Buffer((100,), data=b.data) |
| T_divide_red_rf_1[0] = T_divide_red_rf_1[0] + (a_1[k0_k1_fused_k2_fused_outer * 1024 + threadIdx_x] + b_1[(k0_k1_fused_k2_fused_outer * 24 + threadIdx_x) % 100] + a_1[k0_k1_fused_k2_fused_outer * 1024 + threadIdx_x] * b_1[(k0_k1_fused_k2_fused_outer * 24 + threadIdx_x) % 100]) * T.float32(0.5) |
| reduce_temp0_1 = T.Buffer((1,), data=reduce_temp0, scope="local") |
| with T.attr(T.comm_reducer(lambda x, y: x + y, [T.float32(0)]), "reduce_scope", T.reinterpret("handle", T.uint64(0))): |
| T.tvm_thread_allreduce(T.uint32(1), T_divide_red_rf_1[0], T.bool(True), reduce_temp0_1[0], threadIdx_x) |
| if threadIdx_x == 0: |
| T_divide_red_1 = T.Buffer((1,), data=T_divide_red, align=4) |
| T_divide_red_1[0] = reduce_temp0_1[0] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 98-100 |
| |
| As you can see, scheduled stages of computation have been accumulated and we can examine them by |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 100-102 |
| |
| .. code-block:: default |
| |
| print(sg.stages) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| [stage(a, placeholder(a, 0x1da81280)), stage(b, placeholder(b, 0x11d5d230)), stage(T_add, compute(T_add, body=[a[ax0, ax1, ax2] + b[ax1, ax2]], axis=[T.iter_var(ax0, T.Range(0, 100), "DataPar", ""), T.iter_var(ax1, T.Range(0, 10), "DataPar", ""), T.iter_var(ax2, T.Range(0, 10), "DataPar", "")], reduce_axis=[], tag=broadcast, attrs={})), stage(T_multiply, compute(T_multiply, body=[a[ax0, ax1, ax2] * b[ax1, ax2]], axis=[T.iter_var(ax0, T.Range(0, 100), "DataPar", ""), T.iter_var(ax1, T.Range(0, 10), "DataPar", ""), T.iter_var(ax2, T.Range(0, 10), "DataPar", "")], reduce_axis=[], tag=broadcast, attrs={})), stage(T_elemwise_sum, compute(T_elemwise_sum, body=[T_add[ax0, ax1, ax2] + T_multiply[ax0, ax1, ax2]], axis=[T.iter_var(ax0, T.Range(0, 100), "DataPar", ""), T.iter_var(ax1, T.Range(0, 10), "DataPar", ""), T.iter_var(ax2, T.Range(0, 10), "DataPar", "")], reduce_axis=[], tag=elemwise, attrs={})), stage(T_divide, compute(T_divide, body=[T_elemwise_sum[ax0, ax1, ax2] / T.float32(2)], axis=[T.iter_var(ax0, T.Range(0, 100), "DataPar", ""), T.iter_var(ax1, T.Range(0, 10), "DataPar", ""), T.iter_var(ax2, T.Range(0, 10), "DataPar", "")], reduce_axis=[], tag=elemwise, attrs={})), stage(T_divide_red.rf, compute(T_divide_red.rf, body=[T.reduce(T.comm_reducer(lambda x, y: x + y, [T.float32(0)]), source=[T_divide[(k0_k1_fused_k2_fused_inner + k0_k1_fused_k2_fused_outer * 1024) // 10 // 10, (k0_k1_fused_k2_fused_inner + k0_k1_fused_k2_fused_outer * 1024) // 10 % 10, (k0_k1_fused_k2_fused_inner + k0_k1_fused_k2_fused_outer * 1024) % 10]], init=[], axis=[T.iter_var(k0_k1_fused_k2_fused_outer, T.Range(0, 10), "CommReduce", "")], condition=T.likely((k0_k1_fused_k2_fused_inner + k0_k1_fused_k2_fused_outer * 1024) // 10 // 10 < 100 and (k0_k1_fused_k2_fused_inner + k0_k1_fused_k2_fused_outer * 1024) // 10 < 1000 and k0_k1_fused_k2_fused_inner + k0_k1_fused_k2_fused_outer * 1024 < 10000), value_index=0)], axis=[T.iter_var(k0_k1_fused_k2_fused_inner, T.Range(0, 1024), "DataPar", "")], reduce_axis=[T.iter_var(k0_k1_fused_k2_fused_outer, T.Range(0, 10), "CommReduce", "")], tag=, attrs={})), stage(T_divide_red, compute(T_divide_red.repl, body=[T.reduce(T.comm_reducer(lambda x, y: x + y, [T.float32(0)]), source=[T_divide_red.rf[k0_k1_fused_k2_fused_inner_v]], init=[], axis=[T.iter_var(k0_k1_fused_k2_fused_inner_v, T.Range(0, 1024), "CommReduce", "")], condition=T.bool(True), value_index=0)], axis=[], reduce_axis=[T.iter_var(k0_k1_fused_k2_fused_inner_v, T.Range(0, 1024), "CommReduce", "")], tag=, attrs={}))] |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 103-105 |
| |
| We can test the correctness by comparing with :code:`numpy` result as follows |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 105-116 |
| |
| .. code-block:: default |
| |
| func = tvm.build(sg, [a, b, g], "cuda") |
| dev = tvm.cuda(0) |
| a_np = np.random.uniform(size=(x, y, y)).astype(a.dtype) |
| b_np = np.random.uniform(size=(y, y)).astype(b.dtype) |
| g_np = np.sum(np.add(a_np + b_np, a_np * b_np) / 2.0) |
| a_nd = tvm.nd.array(a_np, dev) |
| b_nd = tvm.nd.array(b_np, dev) |
| g_nd = tvm.nd.array(np.zeros(g_np.shape, dtype=g_np.dtype), dev) |
| func(a_nd, b_nd, g_nd) |
| tvm.testing.assert_allclose(g_nd.numpy(), g_np, rtol=1e-5) |
| |
| |
| |
| |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 117-119 |
| |
| TOPI also provides common neural nets operations such as _softmax_ with optimized schedule |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 119-125 |
| |
| .. code-block:: default |
| |
| tarray = te.placeholder((512, 512), name="tarray") |
| softmax_topi = topi.nn.softmax(tarray) |
| with tvm.target.Target("cuda"): |
| sst = topi.cuda.schedule_softmax(softmax_topi) |
| print(tvm.lower(sst, [tarray], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(tarray: T.Buffer((512, 512), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| T_softmax_norm = T.allocate([65536], "float32x4", "global") |
| blockIdx_x = T.launch_thread("blockIdx.x", 512) |
| normal_reduce_temp0 = T.allocate([1], "float32", "local") |
| reduce_temp0 = T.allocate([1], "float32", "local") |
| T_softmax_exp = T.allocate([512], "float32", "warp") |
| normal_reduce_temp0_1 = T.allocate([1], "float32", "local") |
| reduce_temp0_1 = T.allocate([1], "float32", "local") |
| threadIdx_x = T.env_thread("threadIdx.x") |
| T_softmax_exp_1 = T.Buffer((512,), data=T_softmax_exp, scope="warp") |
| with T.launch_thread(threadIdx_x, 32): |
| normal_reduce_temp0_2 = T.Buffer((1,), data=normal_reduce_temp0, scope="local") |
| normal_reduce_temp0_2[0] = T.float32(-3.4028234663852886e+38) |
| tarray_1 = T.Buffer((262144,), data=tarray.data) |
| for k_inner in range(16): |
| normal_reduce_temp0_2[0] = T.max(normal_reduce_temp0_2[0], tarray_1[blockIdx_x * 512 + threadIdx_x * 16 + k_inner]) |
| with T.attr(T.comm_reducer(lambda x, y: T.max(x, y), [T.float32(-3.4028234663852886e+38)]), "reduce_scope", T.reinterpret("handle", T.uint64(0))): |
| reduce_temp0_2 = T.Buffer((1,), data=reduce_temp0, scope="local") |
| T.tvm_thread_allreduce(T.uint32(1), normal_reduce_temp0_2[0], T.bool(True), reduce_temp0_2[0], threadIdx_x) |
| for i1_inner_outer in range(4): |
| cse_var_1: T.int32 = i1_inner_outer * 4 |
| reduce_temp0_2 = T.Buffer((1,), data=reduce_temp0, scope="local", align=4) |
| T_softmax_exp_1[threadIdx_x * 16 + cse_var_1:threadIdx_x * 16 + cse_var_1 + 4] = T.exp(tarray_1[blockIdx_x * 512 + threadIdx_x * 16 + cse_var_1:blockIdx_x * 512 + threadIdx_x * 16 + cse_var_1 + 4] - T.Broadcast(reduce_temp0_2[0], 4)) |
| T.launch_thread(threadIdx_x, 32) |
| normal_reduce_temp0_2 = T.Buffer((1,), data=normal_reduce_temp0_1, scope="local") |
| normal_reduce_temp0_2[0] = T.float32(0) |
| for k_inner in range(16): |
| normal_reduce_temp0_2[0] = normal_reduce_temp0_2[0] + T_softmax_exp_1[threadIdx_x * 16 + k_inner] |
| with T.attr(T.comm_reducer(lambda x, y: x + y, [T.float32(0)]), "reduce_scope", T.reinterpret("handle", T.uint64(0))): |
| reduce_temp0_2 = T.Buffer((1,), data=reduce_temp0_1, scope="local") |
| T.tvm_thread_allreduce(T.uint32(1), normal_reduce_temp0_2[0], T.bool(True), reduce_temp0_2[0], threadIdx_x) |
| for i1_inner_outer in range(4): |
| T_softmax_norm_1 = T.Buffer((65536,), "float32x4", data=T_softmax_norm) |
| reduce_temp0_2 = T.Buffer((1,), data=reduce_temp0_1, scope="local", align=4) |
| T_softmax_norm_1[blockIdx_x * 128 + threadIdx_x * 4 + i1_inner_outer] = T_softmax_exp_1[threadIdx_x * 16 + i1_inner_outer * 4:threadIdx_x * 16 + i1_inner_outer * 4 + 4] / T.Broadcast(reduce_temp0_2[0], 4) |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 126-137 |
| |
| Fusing convolutions |
| ------------------- |
| We can fuse :code:`topi.nn.conv2d` and :code:`topi.nn.relu` together. |
| |
| .. note:: |
| |
| TOPI functions are all generic functions. They have different implementations |
| for different backends to optimize for performance. |
| For each backend, it is necessary to call them under a target scope for both |
| compute declaration and schedule. TVM will choose the right function to call with |
| the target information. |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 137-147 |
| |
| .. code-block:: default |
| |
| |
| data = te.placeholder((1, 3, 224, 224)) |
| kernel = te.placeholder((10, 3, 5, 5)) |
| |
| with tvm.target.Target("cuda"): |
| conv = topi.cuda.conv2d_nchw(data, kernel, 1, 2, 1) |
| out = topi.nn.relu(conv) |
| sconv = topi.cuda.schedule_conv2d_nchw([out]) |
| print(tvm.lower(sconv, [data, kernel], simple_mode=True)) |
| |
| |
| |
| |
| |
| .. rst-class:: sphx-glr-script-out |
| |
| .. code-block:: none |
| |
| # from tvm.script import ir as I |
| # from tvm.script import tir as T |
| |
| @I.ir_module |
| class Module: |
| @T.prim_func |
| def main(placeholder: T.Buffer((1, 3, 224, 224), "float32"), placeholder_1: T.Buffer((10, 3, 5, 5), "float32")): |
| T.func_attr({"from_legacy_te_schedule": T.bool(True), "tir.noalias": T.bool(True)}) |
| compute = T.allocate([501760], "float32", "global") |
| blockIdx_z = T.launch_thread("blockIdx.z", 5) |
| conv2d_nchw = T.allocate([14], "float32", "local") |
| pad_temp_shared = T.allocate([112], "float32", "shared") |
| placeholder_shared = T.allocate([2], "float32", "shared") |
| blockIdx_y = T.launch_thread("blockIdx.y", 224) |
| blockIdx_x = T.launch_thread("blockIdx.x", 2) |
| threadIdx_z = T.launch_thread("threadIdx.z", 1) |
| threadIdx_y = T.launch_thread("threadIdx.y", 1) |
| threadIdx_x = T.launch_thread("threadIdx.x", 16) |
| conv2d_nchw_1 = T.Buffer((4,), data=conv2d_nchw, scope="local", align=8) |
| conv2d_nchw_1[0] = T.float32(0) |
| conv2d_nchw_1[2] = T.float32(0) |
| conv2d_nchw_1[4] = T.float32(0) |
| conv2d_nchw_1[6] = T.float32(0) |
| conv2d_nchw_1[8] = T.float32(0) |
| conv2d_nchw_1[10] = T.float32(0) |
| conv2d_nchw_1[12] = T.float32(0) |
| conv2d_nchw_1[1] = T.float32(0) |
| conv2d_nchw_1[3] = T.float32(0) |
| conv2d_nchw_1[5] = T.float32(0) |
| conv2d_nchw_1[7] = T.float32(0) |
| conv2d_nchw_1[9] = T.float32(0) |
| conv2d_nchw_1[11] = T.float32(0) |
| conv2d_nchw_1[13] = T.float32(0) |
| for rc_outer, ry_outer in T.grid(3, 5): |
| threadIdx_x_1 = T.env_thread("threadIdx.x") |
| pad_temp_shared_1 = T.Buffer((112,), data=pad_temp_shared, scope="shared") |
| placeholder_2 = T.Buffer((150528,), data=placeholder.data) |
| with T.launch_thread("threadIdx.z", 1) as threadIdx_z_1: |
| threadIdx_y_1 = T.launch_thread("threadIdx.y", 1) |
| T.launch_thread(threadIdx_x_1, 16) |
| pad_temp_shared_1[threadIdx_x_1 * 7] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226 and 1 <= blockIdx_x * 56 + threadIdx_x_1 * 7 // 2, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 450], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 1] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226 and 1 <= blockIdx_x * 56 + (threadIdx_x_1 * 7 + 1) // 2, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 449], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 2] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 448], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 3] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 447], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 4] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 446], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 5] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 445], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 6] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 444], T.float32(0)) |
| threadIdx_x_2 = T.env_thread("threadIdx.x") |
| placeholder_shared_1 = T.Buffer((2,), data=placeholder_shared, scope="shared", align=8) |
| placeholder_3 = T.Buffer((750,), data=placeholder_1.data) |
| with T.launch_thread("threadIdx.z", 1) as threadIdx_z_1: |
| threadIdx_y_1 = T.launch_thread("threadIdx.y", 1) |
| T.launch_thread(threadIdx_x_2, 16) |
| if T.likely(threadIdx_x_2 < 2): |
| placeholder_shared_1[threadIdx_x_2] = placeholder_3[blockIdx_z * 150 + threadIdx_x_2 * 75 + rc_outer * 25 + ry_outer * 5] |
| conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[0] |
| conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[0] |
| conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[0] |
| conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[0] |
| conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[0] |
| conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[0] |
| conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[0] |
| conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[1] |
| conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[1] |
| conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[1] |
| conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[1] |
| conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[1] |
| conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[1] |
| conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[1] |
| threadIdx_z_1 = T.env_thread("threadIdx.z") |
| threadIdx_y_1 = T.env_thread("threadIdx.y") |
| with T.launch_thread(threadIdx_z_1, 1): |
| T.launch_thread(threadIdx_y_1, 1) |
| T.launch_thread(threadIdx_x_1, 16) |
| pad_temp_shared_1[threadIdx_x_1 * 7] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226 and 1 <= blockIdx_x * 56 + (threadIdx_x_1 * 7 + 1) // 2, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 449], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 1] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 448], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 2] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 447], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 3] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 446], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 4] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 445], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 5] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 444], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 6] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 443], T.float32(0)) |
| threadIdx_z_2 = T.env_thread("threadIdx.z") |
| threadIdx_y_2 = T.env_thread("threadIdx.y") |
| with T.launch_thread(threadIdx_z_2, 1): |
| T.launch_thread(threadIdx_y_2, 1) |
| T.launch_thread(threadIdx_x_2, 16) |
| if T.likely(threadIdx_x_2 < 2): |
| placeholder_shared_1[threadIdx_x_2] = placeholder_3[blockIdx_z * 150 + threadIdx_x_2 * 75 + rc_outer * 25 + ry_outer * 5 + 1] |
| conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[0] |
| conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[0] |
| conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[0] |
| conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[0] |
| conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[0] |
| conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[0] |
| conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[0] |
| conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[1] |
| conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[1] |
| conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[1] |
| conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[1] |
| conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[1] |
| conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[1] |
| conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[1] |
| with T.launch_thread(threadIdx_z_1, 1): |
| T.launch_thread(threadIdx_y_1, 1) |
| T.launch_thread(threadIdx_x_1, 16) |
| pad_temp_shared_1[threadIdx_x_1 * 7] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 448], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 1] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 447], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 2] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 446], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 3] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 445], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 4] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 444], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 5] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 443], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 6] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 442], T.float32(0)) |
| with T.launch_thread(threadIdx_z_2, 1): |
| T.launch_thread(threadIdx_y_2, 1) |
| T.launch_thread(threadIdx_x_2, 16) |
| if T.likely(threadIdx_x_2 < 2): |
| placeholder_shared_1[threadIdx_x_2] = placeholder_3[blockIdx_z * 150 + threadIdx_x_2 * 75 + rc_outer * 25 + ry_outer * 5 + 2] |
| conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[0] |
| conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[0] |
| conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[0] |
| conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[0] |
| conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[0] |
| conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[0] |
| conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[0] |
| conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[1] |
| conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[1] |
| conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[1] |
| conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[1] |
| conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[1] |
| conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[1] |
| conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[1] |
| with T.launch_thread(threadIdx_z_1, 1): |
| T.launch_thread(threadIdx_y_1, 1) |
| T.launch_thread(threadIdx_x_1, 16) |
| pad_temp_shared_1[threadIdx_x_1 * 7] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 447], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 1] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 446], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 2] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 445], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 3] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 444], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 4] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 443], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 5] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 442], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 6] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226 and blockIdx_x * 16 + threadIdx_x_1 < 31, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 441], T.float32(0)) |
| with T.launch_thread(threadIdx_z_2, 1): |
| T.launch_thread(threadIdx_y_2, 1) |
| T.launch_thread(threadIdx_x_2, 16) |
| if T.likely(threadIdx_x_2 < 2): |
| placeholder_shared_1[threadIdx_x_2] = placeholder_3[blockIdx_z * 150 + threadIdx_x_2 * 75 + rc_outer * 25 + ry_outer * 5 + 3] |
| conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[0] |
| conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[0] |
| conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[0] |
| conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[0] |
| conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[0] |
| conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[0] |
| conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[0] |
| conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[1] |
| conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[1] |
| conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[1] |
| conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[1] |
| conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[1] |
| conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[1] |
| conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[1] |
| with T.launch_thread(threadIdx_z_1, 1): |
| T.launch_thread(threadIdx_y_1, 1) |
| T.launch_thread(threadIdx_x_1, 16) |
| pad_temp_shared_1[threadIdx_x_1 * 7] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 446], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 1] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 445], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 2] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 444], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 3] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 443], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 4] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 442], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 5] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226 and blockIdx_x * 16 + threadIdx_x_1 < 31, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 441], T.float32(0)) |
| pad_temp_shared_1[threadIdx_x_1 * 7 + 6] = T.if_then_else(2 <= blockIdx_y + ry_outer and blockIdx_y + ry_outer < 226 and blockIdx_x * 112 + threadIdx_x_1 * 7 < 216, placeholder_2[rc_outer * 50176 + blockIdx_y * 224 + ry_outer * 224 + blockIdx_x * 112 + threadIdx_x_1 * 7 - 440], T.float32(0)) |
| with T.launch_thread(threadIdx_z_2, 1): |
| T.launch_thread(threadIdx_y_2, 1) |
| T.launch_thread(threadIdx_x_2, 16) |
| if T.likely(threadIdx_x_2 < 2): |
| placeholder_shared_1[threadIdx_x_2] = placeholder_3[blockIdx_z * 150 + threadIdx_x_2 * 75 + rc_outer * 25 + ry_outer * 5 + 4] |
| conv2d_nchw_1[0] = conv2d_nchw_1[0] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[0] |
| conv2d_nchw_1[2] = conv2d_nchw_1[2] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[0] |
| conv2d_nchw_1[4] = conv2d_nchw_1[4] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[0] |
| conv2d_nchw_1[6] = conv2d_nchw_1[6] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[0] |
| conv2d_nchw_1[8] = conv2d_nchw_1[8] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[0] |
| conv2d_nchw_1[10] = conv2d_nchw_1[10] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[0] |
| conv2d_nchw_1[12] = conv2d_nchw_1[12] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[0] |
| conv2d_nchw_1[1] = conv2d_nchw_1[1] + pad_temp_shared_1[threadIdx_x] * placeholder_shared_1[1] |
| conv2d_nchw_1[3] = conv2d_nchw_1[3] + pad_temp_shared_1[threadIdx_x + 16] * placeholder_shared_1[1] |
| conv2d_nchw_1[5] = conv2d_nchw_1[5] + pad_temp_shared_1[threadIdx_x + 32] * placeholder_shared_1[1] |
| conv2d_nchw_1[7] = conv2d_nchw_1[7] + pad_temp_shared_1[threadIdx_x + 48] * placeholder_shared_1[1] |
| conv2d_nchw_1[9] = conv2d_nchw_1[9] + pad_temp_shared_1[threadIdx_x + 64] * placeholder_shared_1[1] |
| conv2d_nchw_1[11] = conv2d_nchw_1[11] + pad_temp_shared_1[threadIdx_x + 80] * placeholder_shared_1[1] |
| conv2d_nchw_1[13] = conv2d_nchw_1[13] + pad_temp_shared_1[threadIdx_x + 96] * placeholder_shared_1[1] |
| compute_1 = T.Buffer((501760,), data=compute) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x] = T.max(conv2d_nchw_1[0], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 16] = T.max(conv2d_nchw_1[2], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 32] = T.max(conv2d_nchw_1[4], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 48] = T.max(conv2d_nchw_1[6], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 64] = T.max(conv2d_nchw_1[8], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 80] = T.max(conv2d_nchw_1[10], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 96] = T.max(conv2d_nchw_1[12], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 50176] = T.max(conv2d_nchw_1[1], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 50192] = T.max(conv2d_nchw_1[3], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 50208] = T.max(conv2d_nchw_1[5], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 50224] = T.max(conv2d_nchw_1[7], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 50240] = T.max(conv2d_nchw_1[9], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 50256] = T.max(conv2d_nchw_1[11], T.float32(0)) |
| compute_1[blockIdx_z * 100352 + blockIdx_y * 224 + blockIdx_x * 112 + threadIdx_x + 50272] = T.max(conv2d_nchw_1[13], T.float32(0)) |
| |
| |
| |
| |
| .. GENERATED FROM PYTHON SOURCE LINES 148-154 |
| |
| Summary |
| ------- |
| In this tutorial, we have seen |
| |
| - How to use TOPI API for common operations with numpy-style operators. |
| - How TOPI facilitates generic schedules and operator fusion for a context, to generate optimized kernel codes. |
| |
| |
| .. _sphx_glr_download_tutorial_intro_topi.py: |
| |
| .. only:: html |
| |
| .. container:: sphx-glr-footer sphx-glr-footer-example |
| |
| |
| .. container:: sphx-glr-download sphx-glr-download-python |
| |
| :download:`Download Python source code: intro_topi.py <intro_topi.py>` |
| |
| .. container:: sphx-glr-download sphx-glr-download-jupyter |
| |
| :download:`Download Jupyter notebook: intro_topi.ipynb <intro_topi.ipynb>` |
| |
| |
| .. only:: html |
| |
| .. rst-class:: sphx-glr-signature |
| |
| `Gallery generated by Sphinx-Gallery <https://sphinx-gallery.github.io>`_ |