Ragged Attention

Cross-source consensus on Ragged Attention from 1 sources and 5 claims.

1 sources · 5 claims

How it works

The ragged attention kernel implements the FlashAttention-2 online softmax algorithm specialized for bidirectional ViT inference. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
The proposed ragged attention kernel is written in Triton and is part of a three-component system with token packing and end-to-end DeiT integration. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Ragged attention uses packed surviving tokens and cumulative sequence lengths to represent variable-length pruned batches. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
For 39 tokens per image, a single 64-by-64 tile pair covers a full image-head attention computation. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Variable-length execution is presented as necessary for making pruning meaningful in practice, but only if dispatch overhead is low enough. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers