Triton Kernel
Cross-source consensus on Triton Kernel from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Benefits
Risks & contraindications
Comparisons
Highlighted claims
- The Triton kernel has a lower dispatch floor of roughly 0.040 ms in the isolated attention benchmarks. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
- At 50% pruning on DeiT-Base, the Triton pipeline is 2.04x to 2.24x faster than padded SDPA depending on batch size. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
- The Triton ragged pipeline produces monotonic throughput improvements as pruning increases. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
- The Triton kernel is not superior when compute dominates, as shown by its loss to FlashAttention-2 at batch size 64 without pruning. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
- The Triton implementation benefits short pruned ViT attention mainly because its JIT launch path is lighter than the FlashAttention-2 varlen API path. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers