DeiT Inference

Cross-source consensus on DeiT Inference from 1 sources and 5 claims.

1 sources · 5 claims

Uses

The end-to-end pipeline leaves layers 1 through 4 unpruned, applies pruning, packs surviving tokens, and runs layers 5 through 12 with ragged attention and MLP computation. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
The system is pruning-method agnostic as long as the pruning method outputs a per-token keep mask. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
Model-scale experiments show throughput gains across DeiT-Tiny, DeiT-Small, and DeiT-Base. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
End-to-end speedups are limited because layers 1 through 4 and MLP computation do not fully benefit from token reduction. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers
At 90% pruning, the measured 2.13x speedup remains below the 2.66x theoretical ceiling because not all model components benefit equally from token reduction. — Dispatch-Aware Ragged Attention for Pruned Vision Transformers