Vision Transformer Token Pruning
Cross-source consensus on Vision Transformer Token Pruning from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Benefits
Risks & contraindications
Comparisons
Highlighted claims
- At 80% pruning on DeiT-Base, attention FLOPs fall by about 96% because attention cost scales quadratically with token count. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
- Padding variable-length pruned batches can prevent theoretical FLOP reductions from becoming actual latency reductions on GPUs. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
- Token pruning methods reduce theoretical attention cost by removing less informative image patch tokens after early transformer layers. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
- Padded PyTorch execution is reported to be slower than unpruned inference across pruning ratios. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
- The study argues that pruning speedups in current ViT pipelines may come more from reduced MLP work than from reduced attention work. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers