Decoding Speed
Cross-source consensus on Decoding Speed from 1 sources and 4 claims.
1 sources · 4 claims
How it works
Benefits
Highlighted claims
- N-vium improves decoding latency by shifting work off the sequential critical path rather than reducing total per-token computation. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
- The 1.5B Quadrivium model achieved a 57.9% wall-clock speedup while slightly improving perplexity relative to a dense baseline. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
- Across fixed-width depth scaling, speedup generally increased with model depth, except the 6-layer model was slower. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
- Across width scaling at 24 layers, reported speedups ranged from 23.8% to 53.8% without meaningful perplexity degradation. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation