Decoding Speed

Cross-source consensus on Decoding Speed from 1 sources and 4 claims.

1 sources · 4 claims

How it works

N-vium improves decoding latency by shifting work off the sequential critical path rather than reducing total per-token computation. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
The 1.5B Quadrivium model achieved a 57.9% wall-clock speedup while slightly improving perplexity relative to a dense baseline. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
Across fixed-width depth scaling, speedup generally increased with model depth, except the 6-layer model was slower. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
Across width scaling at 24 layers, reported speedups ranged from 23.8% to 53.8% without meaningful perplexity degradation. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation