Attention Architectures
Cross-source consensus on Attention Architectures from 1 sources and 5 claims.
1 sources · 5 claims
Uses
How it works
Comparisons
Highlighted claims
- The study compares GQA, MLA, GDN, and Mamba2 as distinct attention or attention-replacement paradigms. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- MLA uses a compressed latent cache instead of the larger GQA cache in the controlled Minitron-4B comparison. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- GQA and GQA-ctrl remain memory-bound across batch sizes, allowing one low decode clock to work broadly. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- GDN tolerates aggressive underclocking because its decode path is extremely compute-light. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
- MLA and Mamba2 require higher optimal clocks at larger batches because additional per-step work matters more. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures