rfcs/0039-buffer-physical-layout.md - tvm-rfcs - Git at Google

 - Feature Name: Buffer Physical Layout
 - Authors: Eric Lunderberg (@Lunderberg), Wuwei Lin (@vinx13)
 - Start Date: 2021-10-05
 - RFC PR: [apache/tvm-rfcs#0039](https://github.com/apache/tvm-rfcs/pull/0039)
 - GitHub Issue: Not Yet Written

 # Summary
 [summary]: #summary

 This RFC introduces layout transformations that can be applied to a
 buffer during the lowering process.  These transformations will be
 part of the schedule, allowing the same compute definition to be used
 across multiple different layouts.  These transformations can produce
 either flat memory buffers or multi-dimensional memory buffers to be
 exposed to the low-level code generators.

 # Motivation
 [motivation]: #motivation

 Currently, TVM assumes that all buffers can be treated as flat memory.
 That is, while a rank-N tensor requires N values to describe its shape
 and N indices to identify a particular value within it, the underlying
 buffer allocated by the low-level codegen has a single value defining
 the size, and access into that buffer is done using a single index.
 This assumptions holds for most cases, such as a CPU accessing RAM,
 but doesn't hold in all cases.  For example, texture memory on a GPU
 requires two indices to access.  These are currently handled on a
 case-by-case basis, such as using `tvm::tir::builtin::texture2d_store`
 in a `CallNode`.

 In addition, computations that are semantically identical (e.g. 2-d
 convolution) require independent compute definitions and schedules
 (e.g. `conv2d_nchw` and `conv2d_hwcn`) based on the format of the data
 accepted as input.

 This RFC introduces a mechanism to specify transformations to be
 applied to the layout of buffers in memory, along with a unified
 method of presenting multiple indices to the low-level code
 generators.  This will allow for target-specific handling of non-flat
 memory, and will allow for code re-use across compute definitions that
 differ only in memory layout.

 # Guide-level explanation
 [guide-level-explanation]: #guide-level-explanation

 A buffer is represented by a `tvm::tir::Buffer` object, and has some
 shape associated with it.  This shape is initially defined from the
 buffer's shape in the compute definition.  Buffers can either be
 allocated within a `tvm::tir::PrimFunc` using a `tvm::tir::Allocate`
 node, or can be passed in as parameters to a `PrimFunc`.  Buffer
 access is done using `tvm::tir::BufferLoad` and
 `tvm::tir::BufferStore` for reads and writes, respectively.

 When a TIR graph is passed into the low-level code generator
 `tvm::codegen::Build`, the rank of each buffer must be supported by
 the target code generator.  Typically, this will mean generating a
 single index representing access into flat memory.  Some code
 generators may attach alternative semantics for `rank>1`
 buffers (e.g. rank-2 buffers to represent texture memory on OpenCL).
 A low-level code generator should check the rank of the buffers it is
 acting on, and give a diagnostic error for unsupported rank.

 To define the layout transformation in a TE schedule, use the
 `transform_layout` method of a schedule, as shown below.  The
 arguments to `transform_layout` is a function that accepts a list of
 `tvm.tir.Var` representing a logical index, and outputs a list of
 `tvm.tir.PrimExpr` giving a corresponding physical index.  If
 `transform_layout` isn't called, then no additional layout
 transformations are applied.

 For example, below defines the reordering from NHWC logical layout to
 NCHWc physical layout.

 ```python
 # Compute definition, written in terms of NHWC logical axes
 B = te.compute(A.shape, lambda n,h,w,c: A[n,h,w,c])
 s = te.create_schedule(B.op)

 def nhwc_to_nchwc(n, h, w, c):
     return [n, c//4, h, w, c%4]

 transformed_nchwc_axes = s[B].transform_layout(nhwc_to_nchwc)

 # Compute definition that would produce an equivalent physical layout
 B_equivalent = te.compute(
     [A.shape[0], A.shape[3]//4, A.shape[1], A.shape[2], 4],
     lambda n, c_outer, h, w, c_inner: A[n, h, w, 4*c_outer+c_inner],
 )
 ```

 By default, after all explicitly specified layout transformations are
 applied, all axes are flattened to a single axis by following a
 row-major traversal.  This produces a 1-d buffer, which corresponds to
 flat memory.  To produce `rank>1` buffers in the physical layout,
 insert `te.AXIS_SEPARATOR` into the axis list return by the physical
 layout function.  These define groups of axes, where each group is
 combined into a single physical axis.

 ```python
 B = te.compute(shape=(M,N,P,Q), ...)
 s = te.create_schedule(B.op)

 # Default, produces a 1-d allocation with shape (M*N*P*Q,)
 s[B].transform_layout(lambda m,n,p,q: [m,n,p,q])

 # One separator, produces a 2-d allocation with shape (M*N, P*Q).
 s[B].transform_layout(lambda m,n,p,q: [m, n, te.AXIS_SEPARATOR, p, q])

 # Two separators, produces a 3-d allocation with shape (M, N*P, Q).
 s[B].transform_layout(lambda m,n,p,q: [m, te.AXIS_SEPARATOR, n, p, te.AXIS_SEPARATOR, q])

 # Can be used along with reorders and splits.
 s[B].transform_layout(lambda m,n,p,q: [m, q//4, n, te.AXIS_SEPARATOR, p, q%4])
 ```


 The `te.AXIS_SEPARATOR` object exists only within the API interface,
 and is not part of the representation of the layout transformation
 within the generated TIR graph.  Instead, the TIR graph will contain
 an integer list of axis separators, to be used when flattening buffers
 to device-supported rank in the `StorageFlatten` or `FlattenBuffer`
 passes.

 If the tensor whose layout is being transformed is the result of
 `te.compute`, then the loop iteration order over that tensor will be
 rewritten to be along the updated memory layout.  If the loop
 iteration order is modified, these new loop iteration variables will
 be returned from `transform_layout()`.

 ```python
 A = te.placeholder(shape=[16,64,128])
 B = te.compute(A.shape, lambda i,j,k: 2*A[i,j,k])

 s = te.create_schedule(B.op)

 # A is an input placeholder, and doesn't have nested loops that
 # generate it.  Therefore, while the layout of A is rewritten along
 # with any reads/writes into A, there are no loop iterators to be
 # rewritten and no loop iterators are returned.
 s[A].transform_layout(lambda i,j,k: [i*64 + j, k//4, k%4])

 # B is a computed tensor, and is computed inside a sequence of nested
 # loops.  Therefore, when B's layout is rewritten, those nested loops
 # are also rewritten, and the corresponding loop iterators are
 # returned.
 i_outer, jk_merged, i_inner = s[B].transform_layout(lambda i,j,k: [i//4, 128*j + k, i%4])

 # The loop iterators returned by transform_layout() can be used later
 # in the schedule, if the iteration order should be different from the
 # layout order of the output tensor.
 s[B].reorder(i_outer, i_inner, jk_merged)
 ```

 # Reference-level explanation
 [reference-level-explanation]: #reference-level-explanation

 For schedules written in either TE or TIR, the axis separators are stored
 in `BufferNode::axis_separators`.  For TIR-based schedules, the
 re-indexing of a buffer is performed on demand.  For TE-based schedules,
 the mapping used to re-index a buffer is stored in the
 `"layout_transform_map"` attribute of the `PrimFunc`, and is applied as
 part of lowering.  This attribute is a map whose keys are buffer var to
 be reshaped, and whose values are the transformations to be applied.

 Many of the utilities needed for this transformation already exist in
 `iter_affine_map.h`, and are used in the implementation.  For TIR-based
 schedules, the transformation primitive is appleid immediately.

 A buffer may be allocated with `AllocateNode`, and may be interacted
 with using `BufferLoadNode` and `BufferStoreNode`.
 `BufferRealizeNode` should only appear in TE-based schedules, and
 should be converted to `AllocateNode`.  `LoadNode` and `StoreNode`
 are deprecated.

 ## Impacted TIR Nodes

 - BufferNode
   - Describes a N-d buffer.  This may directly represent a tensor (N-d
     buffer produced by TE), a flat memory array (1-d buffer as input
     to the low-level codegen), or intermediates between them.

 - BufferRealizeNode
   - Realization of a buffer, in logical layout.
   - For external buffers, serves as an optional annotation.  For
     internal buffers, results in allocation of memory.


 - BufferLoadNode/BufferStoreNode
   - Read/write of a buffer.

   - Change from previous behavior: Will exist throughout the lowering
     process, and will be passed to the low-level code generators.
     Transformations that previously created `Load` and `Store` nodes
     will instead create `BufferLoad` and `BufferStore` nodes with 1-d
     indices.


 - AllocateNode
   - Allocation of a buffer, in physical layout.

   - Declares an allocation of a buffer.

   - Change from previous behavior: Previously, `AllocateNode` held the
     `buffer_var`, datatype, and buffer extents directly. After
     implementation of this RFC, `AllocateNode` will instead hold the
     `Buffer` that is to be allocated.


 - LoadNode/StoreNode
   - Read/write of a 1-d buffer, given a `Var` pointer to the start of
     the buffer and a single index.

   - Deprecated, should instead use `BufferLoad` and `BufferStore` with
     a 1-d index.


 ## Impacted tir Transformations

 - `ApplyBufferTransforms`
   - A new pass that takes as input a TIR graph that may have buffer
     transformations stored in the `PrimFunc` attributes.  Returns
     a TIR graph with all buffer transforms applied as specified.

   - Rewrite `indices` in BufferStore/BufferLoad nodes based on the
     specified transformation.

   - The transformations are stored as a `Map<Var, Array<IndexMap>>` in
     the `"layout_transform_map"` attribute of a primfunc.
     All buffers whose `BufferNode::data` is a key in this map should
     have their physical layout rewritten.  If the array contains
     multiple transformations, they are applied sequentially.

     A possible structure for the `IndexMap` node is shown
     below.

     ```
     class IndexMapNode : public Object {
     public:
       /*! \brief Variables representing the indices prior to remapping.
        *
        * If initial_index is empty, then final_index should also be
        * empty, and no mapping is applied.
        */
       Array<Var> initial_index;

       /*!
        * \brief Expressions defining the indices after remapping.
        *
        * These expressions should only be in terms of the initial_index,
        * and must be expressible as a `tvm::arith::IterSumExpr`.  The
        * mapping from `initial_index` to `final_index` must be injective.
        *
        * If final_index is empty, then initial_index should also be
        * empty, and the map is an identity function.
        */
       Array<PrimExpr> final_index;
     };
     ```

   - After applying the transformations, the
     `"layout_transform_map"` attribute should be removed.
     This ensures that additional application of
     `ApplyBufferTransforms` has no effect.

 - FlattenBuffer/StorageFlatten

   - Existing passes that convert from logical layout to physical
     layout for TE schedules (StorageFlatten) or TensorIR schedules
     (FlattenBuffer).

   - The transformations are stored in the `Buffer` object as the
     `BufferNode::axis_separators`.  All buffers that share the same
     `BufferNode::data` should be flattened to an
     output buffer of rank `axis_separators.size()+1`.  All other
     buffers should be flattened to a 1-d output buffer.

   - After flattening a buffer to an N-d output, the corresponding
     value in the `axis_separators` should be set to `range(N-1)`.
     This ensures that repeated application of the flattening passes
     have no additional effect.  (The list shouldn't be deleted
     entirely, as that would cause a flattened rank-`N` buffer and an
     unflattened rank-`N` buffer to have identical representations.)


 ## Examples

 The following are intended as pseudo-code, and exclude details not
 relevant to this RFC (e.g. dtype).  These do not correspond with the
 final version of TensorIR that implements this RFC.  Numeric values
 are shown unsimplified to indicate where they come from.

 The first example shows a 2-d buffer with no layout transformations
 explicitly specified.  The generated `PrimFunc` has no
 `"layout_transform_map"` attribute, and so the default
 behavior is used, applying a row-major traversal to generate a flat
 1-d buffer.

 ```python
 # In TE schedule, no call to transform_layout.

 # Initial TIR graph
 x = Buffer(name="x", shape=[64,128])
 with Allocate(x):
     val = BufferLoad(x, [10, 15])
     BufferStore(x, 7, [20, 23])

 # After flattening to 1-d
 x = Var(name="x")
 with Allocate(x, shape=[64*128]):
     val = BufferLoad(x, [10*128 + 15])
     BufferStore(x, 7, [20*128 + 23])
 ```

 This next example shows a 2-d logical buffer, which is lowered to a
 1-d physical buffer.  `transform_layout` has been used to define a
 physical layout whose fastest changing dimension corresponds to the
 first index in the logical layout.

 ```python
 # In TE schedule
 # s[x].transform_layout(lambda i,j: [j,i])

 # Initial TIR graph
 attrs["layout_transform_map"][x] = lambda i,j: [j,i]
 x = Buffer(name="x", shape=[64,128])
 with Allocate(x):
     val = BufferLoad(x, [10, 15])
     BufferStore(x, 7, [20, 23])

 # After applying the explicit reordering
 x = Buffer(name="x", shape=[128,64])
 with Allocate(x):
     val = BufferLoad(x, [15, 10])
     BufferStore(x, 7, [23, 20])

 # After flattening to 1-d
 x = Var(name="x")
 with Allocate(x, shape=[128*64]):
     val = BufferLoad(x, [15*64 + 10])
     BufferStore(x, 7, [23*64 + 20])
 ```

 The next example shows a remapping from NHWC logical layout to NCHWc
 physical layout.  The 4 logical axes are expanded to 5 logical axes
 during the `ApplyBufferTransforms` pass, then flattened into 1 physical
 axis during StorageFlatten/FlattenBuffer.

 ```python
 # In TE schedule
 # s[x].transform_layout(lambda n,h,w,c: [n, c//4, h, w, c%4])

 # Initial TIR graph
 attrs["layout_transform_map"][x] = lambda n,h,w,c: [n, c//4, h, w, c%4]
 x = Buffer(name="x", shape=[16,64,64,128], reorder_splits=nhwc_to_nchwc, axis_separators=[])
 with Allocate(x):
     val = BufferLoad(x, [11, 37, 23, 101])

 # After applying the explicit reordering
 x = Buffer(name="x", shape=[16, 128/4, 64, 64, 4], reorder_splits=[], axis_separators=[])
 with Allocate(x):
     val = BufferLoad(x, index=[11, floor(101/4), 37, 23, 101%4])

 # After flattening to 1-d
 x = Var(name="x")
 with Allocate(x, shape=[16 * (128/4) * 64 * 64 * 4]):
     val = BufferLoad(x, index=[(128/4)*64*64*4*11 + 64*64*4*floor(101/4) + 64*4*37 + 4*23 + 101%4])
 ```

 Lastly, an example of remapping from `NHWC` logical layout to `NCHWc`
 physical layout, packed into a 2-d physical layout with `NCH` in the
 first physical axis and `Wc` in the second physical axis.  This is the
 definition used by the current `"global.texture"` definition used for
 texture memory.  The change applied during SplitReorderIndices is
 identical to the previous example, but StorageFlatten produces a 2-d
 physical index.  The interpretation of this 2-d index depends on the
 target-specific codegen.

 ```python
 # In TE schedule
 # s[x].transform_layout(lambda n,h,w,c: [n, c//4, h, te.AXIS_SEPARATOR, w, c%4])

 # Initial TIR graph
 attrs["layout_transform_map"][x] = lambda n,h,w,c: [n, c//4, h, w, c%4]
 x = Buffer(name="x", shape=[16,64,64,128], axis_separators=[2])
 with Allocate(x):
     val = BufferLoad(x, [11, 37, 23, 101])

 # After applying the explicit reordering.
 x = Buffer(name="x", shape=[16, 128/4, 64, 64, 4], axis_separators=[2])
 with Allocate(x):
     val = BufferLoad(x, index=[11, floor(101/4), 37, 23, 101%4])

 # After applying StorageFlatten or FlattenBuffer.  The final result is
 # 2-d, due to the te.AXIS_SEPARATOR used in the `.transform_layout`.
 # The `axis_separators` are set to [0], to distinguish this 2-d flattened
 # buffer from a 2-d unflattened buffer.

 x = Buffer(name="x", shape=[16 * (128/4) * 64, 64*4], axis_separators=[0])
 with Allocate(x, shape=[16 * (128/4) * 64, 64 * 4]):
     val = BufferLoad(x, index=[(128/4)*64*11 + 64*floor(101/4) + 37, 4*23 + 101%4])
 ```


 # Drawbacks
 [drawbacks]: #drawbacks

 This change may make it more difficult to reason about the memory
 layout when writing the `te.compute` definition.  When the physical
 layout differs from the logical layout, it isn't guaranteed that
 `A[i]` and `A[i+1]` will be adjacent.  For example, a tensor with
 compute definition defined in `NHWC` layout and with layout
 transformation to `NCHWc` defined by `[n, c//4, h, w, c%4]`, locations
 `(0,0,0,3)` and `(0,0,0,4)` in the compute definition will not be
 adjacent.


 # Rationale and alternatives
 [rationale-and-alternatives]: #rationale-and-alternatives

 - Can these design goals be met with existing features?

   The `te.compute` function can be used to define an updated layout.
   However, this introduces a new tensor that must be inlined to avoid
   additional memory allocation, and cannot be used for input
   parameters.

   This design applies equally to tensors defined as a result of a
   computation and to input tensors.  In both cases, the
   `transform_layout` causes all reads/writes to that buffer to obey
   the specified layout.  In the case of input tensors, it states that
   the tensors passed in will be in the specified format.


 - Should buffer transformations be a node within a TIR graph, or an
   attribute?

   Option 1 is preferred.

   - Option 1: The transformations are stored in attributes of
     `PrimFunc`.

     This makes it clear that the transformations apply to all uses of
     the buffer within the graph, and are not scoped to some region of
     the TIR graph.

   - Option 2: The transformations are stored in node that inherits
     from `tir::Stmt`.

     This would be easier for other passes to visit using
     `StmtVisitor`, if the layout transformations require modification.
     However, it would add confusion if a `Stmt` impacts buffers far
     outside its own scope.


 - When should the `tir::transform::ApplyBufferTransforms` pass be
   applied?

   Applying it at the end of phase-2 in `driver_api.cc::CreatePassList`
   satisfies these conditions.

   - To ensure that host and device have the same definition for buffer
     layout, it should occur before the host/device split in
     `MakePackedAPI`.

   - Since other transformations can make use of buffer
     transformations, it should otherwise be as late as possible in the
     lowering flow.  (e.g. `InjectDoubleBuffer` mapping to a new buffer
     shape)


 - Should buffer transformations re-use functionality of other nodes?

   Option 1 is preferred.

   - Option 1: Add buffer transformations as an attribute to the
     `PrimFunc`.

   - Option 2: In TE-based schedules, `AttrStmtNode` could give the
     buffer to be transformed, along with the transformation to be
     applied, similar to how `buffer_bind_scope` is currently handled.

     The `BufferTransform` must also contain multiple objects that are
     not derived from `PrimExpr`, the buffer to be transformed and the
     mapping to be applied, while `AttrStmtNode` only allows a single
     `ObjectRef` node and a `PrimExpr` value.

   - Option 3: In TensorIR-based schedules, `MatchBufferRegion` could
     be extended to also include a transformation while performing the
     buffer replacement.

     However, this could make it more difficult to reason about which
     locations in the buffer region are being accessed.

   - Option 4: The `BufferNode` object could contain an array of
     transformations that should be applied to it during the lowering
     process.  This would be convenient and allow for arbitrarily many
     transformations.

     Wouldn't follow the TVM convention of having annotations external
     to the node itself.


 - Where should transformations to be applied to the function inputs be
   specified?

   Option 1 is preferred.

   - Option 1: Any `BufferTransform` that describes a buffer in the
     `PrimFuncNode::buffer_map` gets applied to that buffer.

     Would require two traversals, the first to locate all buffer
     transforms, and the second to apply them.

   - Option 2: `BufferTransform` nodes listed in the `PrimFunc::attrs`
     under a `"buffer_argument_transforms"` key apply to the function arguments.

     Would only need a single traversal to apply.

     Would require other passes to be aware of where a buffer was first
     defined, in order to add it to the appropriate location.


 - What arguments should the function passed to `transform_layout` accept?

   In these examples, the tensor is rank `N` prior to the
   transformation.

   Option 3 is preferred.

   - Option 1: Accept a list of length `N`.  Each element of the list
     is a variable corresponding to a coordinate in the input tensor.

     This would be the simplest python implementation, but would
     require additional configuration to have named variables in the
     mapping.

   - Option 2: Accept `N` named positional arguments (`func(i,j,k)`),
     where each argument is a variable corresponding to a coordinate in
     the input tensor.

     This follows the usual method of defining the `fcompute` function
     passed to `te.compute`.  This also allows the named variables to
     be used as the names in TIR, improving readability.

     However, this wouldn't allow utility functions that define
     transformations that apply to an arbitrary number of indices, such
     as a layout transformation that changes the last index, while
     leaving the other `N-1` indices untouched.

   - Option 3: Accept either `N` named positional arguments
     (`func(i,j,k)`), or a variable number of arguments
     (`func(*indices)`).

     This follows the same convention as the `fcompute` function passed
     to `te.compute`.  This would allow either an explicit listing of
     all indices as named arguments, or an arbitrary number of indices.


 - What convention should be used for buffer indexing?

   Previously, the interpretation of an index into a buffer depended on
   whether the buffer was being accessed with
   `BufferStore`/`BufferLoad` (pre-flattening) or with `Store`/`Load`
   (post-flattening).  Since the same data structures will be used at
   all lowering stages, the indexing should have consistent semantics.

   Option 1 is preferred.

   - Option 1: When accessing a buffer, the type and offset are based on
     `buffer->dtype`.

     The offset of an element is given by `index *
     sizeof(buffer->dtype)`.  The type of the element being accessed is
     `buffer->dtype.with_lanes(index.lanes() * buffer->dtype.lanes())`.

     This is the convention used by user-defined schedules in TE, and
     in BufferLoad/BufferStore objects.  In this convention, scalar
     loads and vectorized loads can be expressed for scalar buffers and
     vectorized buffers.  Accessing a buffer to return a different
     datatype requires declaring an aliasing buffer that shares the
     same backing array.

     ```python
     @T.prim_func
     def scalar_load_from_scalar_buffer(A: T.Buffer[(64,), "float32"]):
         assert A[0].dtype == "float32"


     @T.prim_func
     def vector_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
         assert A[0].dtype == "float32x4"


     @T.prim_func
     def vector_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
         A_vector_2 = T.buffer_decl([32], "float32x2", data=A.data)
         assert A[0].dtype == "float32x4"
         assert A_vector_2[0].dtype == "float32x2"


     @T.prim_func
     def vector_load_from_scalar_buffer_option1(A: T.Buffer[(64,), "float32"]):
         assert A[T.ramp(0, 1, 4)].dtype == "float32x4"


     @T.prim_func
     def vector_load_from_scalar_buffer_option2(A: T.Buffer[(64,), "float32"]):
         A_vector = T.buffer_decl([16], "float32x4", data=A.data)
         assert A_vector[0].dtype == "float32x4"


     @T.prim_func
     def scalar_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
         A_scalar = T.buffer_decl([64], "float32", data=A.data)
         assert A_scalar[0].dtype == "float32"
     ```

     - Pro: The return type of `buf[0]` is always `buf.dtype`, even
       when `buf.dtype` is a vectorized type.

     - Pro: No changes needed on the user-defined schedules.

     - Con: Requires updates to code generators to follow this new
       convention.  However, the code generators will already require
       updates to support BufferLoad/BufferStore.

   - Option 2: When accessing a buffer, the type and offset are based on
     `buffer->dtype.element_of()`.

     The offset of an element is given by `index *
     sizeof(buffer->dtype.element_of())`.  The type of the element
     being accessed is `buffer->dtype.with_lanes(index.lanes())`.

     Prior to this RFC, this is the convention used by Load/Store
     nodes.  In this convention, scalar loads and vectorized loads can
     be expressed for scalar buffers and vectorized buffers.  Accessing
     a buffer to return a vectorized datatype requires using a
     vectorized index, even if the buffer holds a vectorized datatype.

     ```python
     @T.prim_func
     def scalar_load_from_scalar_buffer(A: T.Buffer[(64,), "float32"]):
         assert A[0].dtype == "float32"


     @T.prim_func
     def vector_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
         assert A[T.ramp(0, 1, 4)].dtype == "float32x4"


     @T.prim_func
     def scalar_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]):
         assert A[0].dtype == "float32"


     @T.prim_func
     def vector_load_from_scalar_buffer(A: T.Buffer[(64,), "float32"]):
         assert A[T.ramp(0, 1, 4)].dtype == "float32x4"
     ```

     - Pro: The number of lanes of output can be determined solely from
       the index used to access the buffer.  That is, `A[0]` is
       guaranteed to have one lane of output, and `A[Ramp(0, stride=1,
       lanes=4)]` is guaranteed to have four lanes of output.

     - Con: Access of a buffer with scalar index does not always have
       the same datatype as the buffer.  If the buffer has a vectorized
       datatype, then `buf[0].dtype != buf.dtype`.

     - Con: Need explicit check for vectorized types at the codegen
       level.

     - Con: Requires updates to user-defined schedules.


 # Prior art
 [prior-art]: #prior-art

 - CuDNN has an [explicit enumeration of allowed input
   formats](https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnTensorFormat_t),
   which are specific to image formatting.

 - The reorder/split/flatten sequences is equivalent in numpy to using
   `np.reshape` to split the logical axes, then `np.transpose` to
   reorder them, then `np.reshape` to merge multiple axes into the N-d
   physical axes.

 # Unresolved questions
 [unresolved-questions]: #unresolved-questions

 - Should the `te.AXIS_SEPARATOR` appear in the TIR graph?

   Option 1 is preferred.

   - Option 1: The `te.AXIS_SEPARATOR` is a TE-specific concept, and
     does not appear in the generated TIR graph.  Instead, it changes
     the `BufferTransform` node that represent the flattening of
     buffers to a device-supported number of indices.

     This would be a unified way to represent all layout
     transformations in the TIR graph, which may or may not change the
     rank of the buffer.  The flattening of buffers to a
     device-supported rank would be handled identically to any other
     layout transformation, rather than having an implicit row-major
     traversal.

   - Option 2: The `te.AXIS_SEPARATOR` is represented in the TIR graph,
     and alters the behavior of the `StorageFlatten` pass.  There is no
     `BufferTransform` node that represents the flattening of

     In a TIR graph without any other modifications, this would
     maintain the current behavior of the `StorageFlatten` pass, which
     reduces the N-d buffer to a 1-d buffer by a row-major traversal.
     In a TIR graph with some additional annotation to represent the
     `M` axis separators, the N-d buffer could instead be reduced to a
     `M+1`-d buffer.

 - What is appropriate terminology for size/shape/extent of physical
   and logical buffers?

   If Allocate/BufferStore/BufferLoad each hold a reference to the
   buffer they act upon, then this becomes a somewhat irrelevant
   question, as there is only one `BufferNode::shape`.

   - I am partial to using "shape" both for the N-d parameters, and
     have attempted to use it consistently through this RFC.
   - "size" implies a 1-d buffer, which wouldn't be appropriate for
     an N-d parameter.
   - "extent" would be a reasonable name, but is currently used by
     `tvm::RangeNode` to indicate a range of values that may start at
     a non-zero value.  Since the indices for logical and physical
     buffers both start at zero, using "extents" for the maximum
     index would imply some offset.


 - How should loops over an array be handled when re-writing the shape?

   To avoid memory latency issues, loops should iterate over an array
   sequentially when possible.  Iteration that is defined in terms of
   the logical layout may be inappropriate for the physical layout.

   Option 3 is preferred.

   - Option 1: Do nothing, and always keep the same iteration order,
     using the same iteration axes as defined in the compute
     definition.

     This would produce valid code, but not necessarily performant
     code.  This can be a default behavior during development, to be
     improved upon.

   - Option 2: Automatically detect loops that are over the full extent
     of an array in sequential order of the logical layout, and rewrite
     to be in sequential order of the physical layout.

     This would reduce the memory latency issues, but raises some
     implementation questions.

     - If a loop body references multiple tensors with different
       physical layouts, which should define the loop iteration order?

     - If a series of nested loops contains a `cache_read` or
       `cache_write` stage, can these be recognized and reordered?

   - Option 3: Expose the transformed axes to be used as part of a
     schedule definition.  In TE, the return value from `AA =
     s[A].transform_layout(...)` would be a tensor, and the transformed
     axes `AA.op.axis` can then be used for the remainder of the
     schedule.

     This would allow the greatest flexibility, but would make the
     schedule dependent on the transformed layout, beyond the one
     definition.

 # Future possibilities
 [future-possibilities]: #future-possibilities

 - Could be used to simplify many of the `topi` schedules for image
   processing.
 - Could introduce variation of physical layout during `cache_read` and
   `cache_write` steps, as a potential source of optimization.