Request-Level Energy

Cross-source consensus on Request-Level Energy from 1 sources and 5 claims.

1 sources · 5 claims

Benefits

Novel attention replacements can spend more energy in prefill but later recover it through more efficient decode. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Mamba2 achieved constant per-step decode latency regardless of context length and a large energy advantage over GQA at large batch and long context. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
GDN and Mamba2 used about an order of magnitude more prefill energy per token than transformer baselines in the tested vLLM setup. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
At short context, MLA was worse than GQA-ctrl because weight loading dominated HBM traffic and decompression added overhead. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
At production-like batch size 32, MLA became the cheapest architecture from nearly the first output token due to decode efficiency. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures