[Relax][Frontend][KVCache] Extend masked sequence prefill to causal left-padding (#19431) This PR extends `_attention_sequence_prefill_with_mask` to support a second mask regime for decoder-style embedding workloads. ### Summary - Keep the existing right-padded bidirectional behavior as `mask_mode="padded"`. - Add `mask_mode="causal_padded_left"` for left-padded causal sequence prefill. - Add a `softmax_update_causal_padded_left` macro for the online softmax mask. - Add tests for causal left-padding with zero, full, mixed, and GQA valid lengths. ### Motivation This is a TVM-side kernel dependency for the first-class embedding serving work tracked in mlc-ai/mlc-llm#3451. The existing masked sequence prefill kernel supports encoder-style batches where real tokens occupy the valid prefix `[0, valid_len)` and padding is on the right. Decoder-style embedding batches, such as the decoder-only embedding path, commonly left-pad variable-length inputs so the final real token / EOS lands at the same final column across the batch. This allows last-token pooling to read `output[:, -1, :]`, while still requiring causal masking within each valid suffix. For each batch row: - `mask_mode="padded"`: real tokens are `[0, valid_len)`. - `mask_mode="causal_padded_left"`: real tokens are `[seq_len - valid_len, seq_len)`, with `col <= row`. ### Testing - `git diff --check` - Attempted: `python -m pytest -q tests/python/relax/test_frontend_nn_llm_sequence_prefill_masked.py -k 'causal_padded_left or valid_len_mixed'`
Documentation | Contributors | Community | Release Notes
Apache TVM is an open machine learning compilation framework, following the following principles:
TVM is licensed under the Apache-2.0 license.
Check out the TVM Documentation site for installation instructions, tutorials, examples, and more. The Getting Started with TVM tutorial is a great place to start.
TVM adopts the Apache committer model. We aim to create an open-source project maintained and owned by the community. Check out the Contributor Guide.
TVM started as a research project for deep learning compilation. The first version of the project benefited a lot from the following projects:
Since then, the project has gone through several rounds of redesigns. The current design is also drastically different from the initial design, following the development trend of the ML compiler community.
The most recent version focuses on a cross-level design with TensorIR as the tensor-level representation and Relax as the graph-level representation and Python-first transformations. The project's current design goal is to make the ML compiler accessible by enabling most transformations to be customizable in Python and bringing a cross-level representation that can jointly optimize computational graphs, tensor programs, and libraries. The project is also a foundation infra for building Python-first vertical compilers for domains, such as LLMs.