Pretraining Mismatch
Cross-source consensus on Pretraining Mismatch from 1 sources and 5 claims.
1 sources · 5 claims
Benefits
Risks & contraindications
Comparisons
Highlighted claims
- Replacing dense attention inside a pretrained model changes its computation graph and can hurt performance. — Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
- OpenWebText GPT-2 fine-tuning from dense-attention pretrained weights favored dense fine-tuning over ChaCAL or Block-ChaCAL variants. — Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
- Direct Block-ChaCAL decoder-layer substitution underperformed standard dense fine-tuning on SCROLLS with BART-base. — Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
- When trained from scratch on WikiText-103, Block-ChaCAL outperformed dense attention and dense ChaCAL in perplexity. — Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity
- Encoder-decoder models appeared especially sensitive to operator substitution because encoder representations and decoder attention dynamics are coupled under dense attention. — Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity