Autoregressive Transformers
Cross-source consensus on Autoregressive Transformers from 1 sources and 5 claims.
1 sources · 5 claims
Uses
How it works
Risks & contraindications
Comparisons
Evidence quality
Highlighted claims
- Autoregressive generalization was tested with GPT2-style transformers on the same parity datasets. — The two clocks and the innovation window: When and how generative models learn rules
- GPT models reproduced the two-clock structure of rule learning followed by memorization. — The two clocks and the innovation window: When and how generative models learn rules
- GPT learned and memorized faster than matched DiT models, leading to a shorter innovation window. — The two clocks and the innovation window: When and how generative models learn rules
- GPT rule learning was concentrated at the last bit of each parity group. — The two clocks and the innovation window: When and how generative models learn rules
- GPT weight decay delayed memorization more strongly than DiT weight decay in the optimization ablations. — The two clocks and the innovation window: When and how generative models learn rules