CATS
Cross-source consensus on CATS from 1 sources and 5 claims.
1 sources · 5 claims
Uses
How it works
Benefits
Highlighted claims
- CATS is a self-speculative decoding framework designed for memory-limited LLM inference with parameter offloading. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
- CATS avoids using a separate auxiliary draft model and keeps peak device memory equal to the target model alone. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
- CATS partitions the target transformer into a draft model, a shallow verifier, and remaining target layers. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
- CATS improves accepted length and wall-clock speed under edge-device constraints. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
- CATS can be configured by choosing LDM and LSV to match available memory. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration