LLM Decode

Cross-source consensus on LLM Decode from 1 sources and 4 claims.

1 sources · 4 claims

How it works

Roofline analysis placed decode kernels for all tested architectures deep in the memory-bound region. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
LLM decode is dominated by sequential matrix-vector work and memory traffic from weights, KV cache, or state. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Tensor cores were mostly idle during decode because time was spent loading data from HBM. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
Batching improves decode energy efficiency by amortizing weight loads but does not remove decode's low arithmetic intensity. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures