Latency
Cross-source consensus on Latency from 1 sources and 5 claims.
1 sources · 5 claims
Uses
Benefits
Risks & contraindications
Comparisons
Highlighted claims
- Measured wall-clock speedup was lower than theoretical layer reduction, especially at larger batch sizes. — LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
- On an NVIDIA L4 at batch size 1, early exit reduced latency from 8.46 ms to 5.25 ms. — LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
- The speedup diminished at batch size 32 because GPU parallelism already amortized layer cost. — LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
- LEAP is positioned as most useful for latency-sensitive embedding services that plan to use early exit. — LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference
- Throughput-oriented offline workloads may prefer a shallow student model over LEAP early exit. — LEAP: Layer-wise Exit-Aware Pretraining for Efficient Transformer Inference