LLM Decode
Cross-source consensus on LLM Decode from 1 sources and 4 claims.
1 sources · 4 claims
How it works
Evidence quality
Highlighted claims
- Roofline analysis placed decode kernels for all tested architectures deep in the memory-bound region. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- LLM decode is dominated by sequential matrix-vector work and memory traffic from weights, KV cache, or state. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- Tensor cores were mostly idle during decode because time was spent loading data from HBM. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- Batching improves decode energy efficiency by amortizing weight loads but does not remove decode's low arithmetic intensity. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures