In https://github.com/apache/tvm/pull/9727 and RFC#39, we deprecated Load
and Store
to use BufferLoad
and BufferStore
instead in order to support generalized multi-dimensional physical buffer access. Here we document necessary clarifications, implications about the new buffer convention, as well as the post-hoc pass checklist.
The main goal of this RFC is to summarize the existing buffer convention and the IR changes in https://github.com/apache/tvm/pull/9727 which have a broader impact. There are no new semantics proposed in this RFC.
What’s a buffer?
Buffer is a compile-time representation of contiguous block of memory. Since a Buffer is typically used as backing storage for a TensorType
, it includes relevant information from that TensorType
which can be sufficiently generalized to an array, such as data type and shape information. A Buffer needs to be declared and allocated before it can be used.
Declaration of buffer
Buffer can be declared in the following ways:
buffer_map
of PrimFunc
. TIR's type system does not accommodate rich array types, instead representing them as T.handle
(typically emitted as void*
). The buffer_map
specifies how to interpret such T.handle
when using it as a basis for array accesses.T.alloc_buffer
is used S-TIR
to create and allocate a buffer.T.buffer_decl
can be used to create a buffer alias by specifying the underlying data variable to reuse the data from another buffer. It can also be used to reinterpret the data type of the buffer. T.buffer_decl
can also be used to create a buffer alias with a different elem_offset
. elem_offset
should be handled during the lowering process.Examples of T.buffer_decl
is shown below.
@T.prim_func def buffer_alias(A: T.Buffer[(16,), "float"]): A_vector = T.buffer_decl([4], "float32x4", data=A.data) @T.prim_func def buffer_alloc(): A = T.buffer_decl([4, 4], "float32") Allocate(A.data, [16], "float32")
In the future, we will consider renaming T.buffer_decl
to T.decl_buffer
to make it name a verb phase that is consistent with the existing ones like T.alloc_buffer
, T.match_buffer
.
Allocation of buffer
In low-level TIR, tir::Allocate
is used to allocate a data variable with given shapes. tir::Allocate
returns a data variable of type T.handle
(since TIR's type system does not accommodate rich arrays), which may be reinterpreted with a different shape or data type using T.buffer_decl
.
Explicit DeclBuffer
IR construct
T.buffer_decl
doesn't correspond to a TIR node. Instead, T.buffer_decl
returns either:
The current behavior of TVMScriptPrinter
is to implicitly print a T.buffer_decl
at the beginning of PrimFunc
for any undefined buffers. The implicit behavior can be error-prone. In light of the migration, we should consider an explicit DeclBuffer
as part of the IR. This will be further discussed in a separate RFC.
Buffer Aliasing
T.buffer_decl
creates a buffer alias if the underlying data variable (.data
field) overlaps with another buffer. Buffer created via T.alloc_buffer
always do not alias. Buffer aliases do not need Allocate
to create the data variable -- they may simply reuse the data variable from the Buffer being aliased. If a transformation would produce multiple allocations of the same buffer var (e.g. unrolling a loop that contains an allocation), the transform should update the allocations to be unique using tvm::tir::ConvertSSA
.
Buffers should not alias each other unless necessary, because aliased buffers increase complexity for TIR transformations. Passes that rewrite buffers should clearly indicate how aliased buffers are handled. For example, when changing the underlying layout of stored elements in a buffer, all buffer aliases must also be updated. Currently, we don't have analysis for buffer aliasing. This is a future developement task if buffer aliasing is used broadly. Therefore, while buffer aliasing is typically free at runtime, this imposes a cost for buffer aliasing both to compile times and development complexity.
Discussion: When it is safe to transform a buffer
We would like to discuss some examples of when it is safe to transform a buffer w.r.t. aliasing rules:
(1) is fine under aliasing as long as the low level memory is shared. This is because buffer alias here is used to reinterpret a buffer, which only changes the way we access the buffer. As long as there are no other buffer transformations or analysis applied to this buffer, it is safe to use the alias.
On the other hand, any transformations or analysis applied on a buffer should be clear how to handle buffer aliases correctly. (2) and (3) are such examples, they would need more cares. (2) requires all the aliases be changed together. (3) requires to compute the compact buffer shape and then rewrite the buffer shape. This need us to take all alias into consideration and then rewrite their shapes together.
Generalizing buffer accesses
Previously we used Load
and Store
to represent low-level buffer accesses. Load
and Store
consist of data variable, data type and index, which can be directly translated to pointer cast and accesses in runtime. Note that data type given to Load
/ Store
can be different from the Buffer's data variable type. For example,
A = T.buffer_decl(shape=(16,), dtype='float') T.load("float4", A.data, T.ramp(4, 1, 4))
can be translated to
*((float4*)(A + 4))
in C codegen.
However, BufferLoad
and BufferStore
themselves can not reinterpret a buffer to a different shape or data type. They always return the data type specified on underlying buffer object. This is the fundamental difference between Load/Store
and BufferLoad/BufferStore
that we need to deal with carefully.
Vectorized access is achieved by using Ramp
as index in Load/Store
. Vectorized buffer access via BufferLoad
/BufferStore
can be achieved either by using a scalar index to access a buffer that has a vectorized type, or by using Ramp
as an index into a buffer that has a scalar type. For N-D buffer indices, it is possible that Ramp
being used in multiple dimensions (e.g. A[Ramp(...), ..., Ramp(...)]
). In this case the number of lanes of the data type of such value is the product of each Ramp
. We limit Ramp
to only the last dimension as multiple Ramp
creates additional complexity.
Different combinations of buffer type and index type (scalar vs. vector) are clarified in
RFC#39, excerpts are the following:
@T.prim_func def scalar_load_from_scalar_buffer(A: T.Buffer[(64,), "float32"]): assert A[0].dtype == "float32" @T.prim_func def vector_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]): assert A[0].dtype == "float32x4" @T.prim_func def vector_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]): A_vector_2 = T.buffer_decl([32], "float32x2", data=A.data) assert A[0].dtype == "float32x4" assert A_vector_2[0].dtype == "float32x2" @T.prim_func def vector_load_from_scalar_buffer_option1(A: T.Buffer[(64,), "float32"]): assert A[T.ramp(0, 1, 4)].dtype == "float32x4" @T.prim_func def vector_load_from_scalar_buffer_option2(A: T.Buffer[(64,), "float32"]): A_vector = T.buffer_decl([16], "float32x4", data=A.data) assert A_vector[0].dtype == "float32x4" @T.prim_func def scalar_load_from_vector_buffer(A: T.Buffer[(16,), "float32x4"]): A_scalar = T.buffer_decl([64], "float32", data=A.data) assert A_scalar[0].dtype == "float32" #multiple dimensional buffer accesses @T.prim_func def nd_scalar_load_from_scalar_buffer(A: T.Buffer[(64, 64,), "float32"]): assert A[0, 0].dtype == "float32" @T.prim_func def nd_vector_load_from_scalar_buffer(A: T.Buffer[(64,64), "float32"]): assert A[0, T.ramp(0, 1, 4)].dtype == "float32x4"
In rare cases, vector index can be used to access a vector buffer. We leave this usage as undefined until we have a clear use case.
VectorBufferRewrite
In some backend like SPIR-V where runtime pointer casts are not available, even between types that differ only in the number of lanes (e.g. float16
and float16x4.
), VectorTypeRewriter
will be used to rewrite the buffer to a vector type. (VectorBufferRewrite rewrites the buffer from vector_load_from_scalar_buffer
into scalar_load_from_vector_buffer
in the above example).
Removing pre-flattened buffer
Buffer information before flattening are necessary during compilation. They specify the calling convention of PrimFunc
and are translated to assertions of buffer shapes, strides, etc. in runtime. preflattened_buffer_map
was introduced in https://github.com/apache/tvm/pull/9727 to save these information after buffer flattening.
During the lowering process, although buffer accesses inside PrimFunc
are flattened to match physical buffer dimensions, the calling convention of the PrimFunc
are kept unchanged - It still expect the parameter to have multi-dimensional logical buffer shape. Therefore, we would like to unify preflattened_buffer_map
and buffer_map
. buffer_map
should be kept unchanged during buffer flattening. Instead, we declare an aliasing buffer as the flattened buffer after flattening. For example, after flattening, the TIR will look like
def fn(X: T.Buffer([2, 3], "float32"): X_flattened = T.buffer_decl(X.data, [6], "float32") for i in grid(6): X_flattened[i] = ....
Here are a list of TIR passes that can be impacted significantly when migrating from Load/Store
to BufferLoad/BufferStore
.
StorageFlatten
/ FlattenBuffer
: These passes flatten buffer to physical dimensions. As discussed above, they should create flattened buffer via T.buffer_decl
while keeping buffer_map
unchanged (see the discussion in Removing pre-flattened buffer section). Any subsequent passes that rewrite buffer, such as, InjectDoubleBuffer
, InjectVirtualThread
, should operate on physical buffers and should not changing the number of buffer dimensions. Allocate
after flattening will reflect physical buffer dimensions. Alternatively, these passes could be made simpler by moving them to occur before the buffer is flattened. For example, implementing InjectDoubleBuffer
by changing the shape to [2, *old_shape]
, and accessing using [i%2, *old_indices]
. That would limit the size/stride handling to occur only during buffer flattening.Ramp
for vectorized accesses, should consider limiting vector index as the last dimension.StorageRewrite
: This pass should be extended to handle N-D physical buffer.VectorTypeRewriter
should also consider limiting vector index as the last dimension.MakePackedAPI
: This pass adds additional parameters (variables) to PrimFunc
according to the FFI calling convention. These variables can no longer be used in Load
directly. Buffer should be declared and then BufferLoad
should be used to access values of these parameters.LowerThreadAllreduce
: This pass is involved with a few buffer rewriting. Need to check buffer declarations / accesses follow the new convention here.T.buffer_decl
creates buffer alias, it is important to consider implications and use T.buffer_decl
properly. Passes that transform buffers should consider how to buffer alias. Therefore we should be able to have a unified method called T.buffer_decl
in both TIR and TVMScript.T.buffer_decl, T.match_buffer, T.alloc_buffer
.BufferLoad/BufferStore
can be generalized to allow Ramp
as part of the index.T.buffer_decl
is going to be used to declare flattened Buffer aliases, and preflattened_buffer_map
will be removed.