Vision Transformer Token Pruning

Cross-source consensus on Vision Transformer Token Pruning from 1 sources and 5 claims.

1 sources · 5 claims

How it works

At 80% pruning on DeiT-Base, attention FLOPs fall by about 96% because attention cost scales quadratically with token count. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Padding variable-length pruned batches can prevent theoretical FLOP reductions from becoming actual latency reductions on GPUs. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Token pruning methods reduce theoretical attention cost by removing less informative image patch tokens after early transformer layers. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Padded PyTorch execution is reported to be slower than unpruned inference across pruning ratios. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
The study argues that pruning speedups in current ViT pipelines may come more from reduced MLP work than from reduced attention work. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers