Autoregressive Transformers

Cross-source consensus on Autoregressive Transformers from 1 sources and 5 claims.

1 sources · 5 claims

Uses

Autoregressive generalization was tested with GPT2-style transformers on the same parity datasets. — The two clocks and the innovation window: When and how generative models learn rules
GPT models reproduced the two-clock structure of rule learning followed by memorization. — The two clocks and the innovation window: When and how generative models learn rules
GPT learned and memorized faster than matched DiT models, leading to a shorter innovation window. — The two clocks and the innovation window: When and how generative models learn rules
GPT rule learning was concentrated at the last bit of each parity group. — The two clocks and the innovation window: When and how generative models learn rules
GPT weight decay delayed memorization more strongly than DiT weight decay in the optimization ablations. — The two clocks and the innovation window: When and how generative models learn rules