Attention Architectures

Cross-source consensus on Attention Architectures from 1 sources and 5 claims.

1 sources · 5 claims

Uses

The study compares GQA, MLA, GDN, and Mamba2 as distinct attention or attention-replacement paradigms. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
MLA uses a compressed latent cache instead of the larger GQA cache in the controlled Minitron-4B comparison. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
GQA and GQA-ctrl remain memory-bound across batch sizes, allowing one low decode clock to work broadly. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
GDN tolerates aggressive underclocking because its decode path is extremely compute-light. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures
MLA and Mamba2 require higher optimal clocks at larger batches because additional per-step work matters more. — The Illusion of Power Capping in LLM Decode: A Phase-Aware Energy Characterisation Across Attention Architectures