Triton Kernel

Cross-source consensus on Triton Kernel from 1 sources and 5 claims.

1 sources · 5 claims

How it works

The Triton kernel has a lower dispatch floor of roughly 0.040 ms in the isolated attention benchmarks. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
At 50% pruning on DeiT-Base, the Triton pipeline is 2.04x to 2.24x faster than padded SDPA depending on batch size. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
The Triton ragged pipeline produces monotonic throughput improvements as pruning increases. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
The Triton kernel is not superior when compute dominates, as shown by its loss to FlashAttention-2 at batch size 64 without pruning. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
The Triton implementation benefits short pruned ViT attention mainly because its JIT launch path is lighter than the FlashAttention-2 varlen API path. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers