Request-Level Energy
Cross-source consensus on Request-Level Energy from 1 sources and 5 claims.
1 sources · 5 claims
Benefits
Risks & contraindications
Comparisons
Highlighted claims
- Novel attention replacements can spend more energy in prefill but later recover it through more efficient decode. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- Mamba2 achieved constant per-step decode latency regardless of context length and a large energy advantage over GQA at large batch and long context. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- GDN and Mamba2 used about an order of magnitude more prefill energy per token than transformer baselines in the tested vLLM setup. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- At short context, MLA was worse than GQA-ctrl because weight loading dominated HBM traffic and decompression added overhead. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- At production-like batch size 32, MLA became the cheapest architecture from nearly the first output token due to decode efficiency. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures