Mixture Objective
Cross-source consensus on Mixture Objective from 1 sources and 4 claims.
1 sources · 4 claims
How it works
Highlighted claims
- Training optimizes the mixture distribution directly with cross-entropy. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
- Router warmup is used early in training to prevent collapse by encouraging uniform expected exit use. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
- The method adds an expected normalized depth penalty to encourage efficient exits. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation
- With beta set to zero, N-vium reduces to next-token optimization over the mixture without a compute penalty. — N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation