Memory-Limited Edge Inference
Cross-source consensus on Memory-Limited Edge Inference from 1 sources and 5 claims.
1 sources · 5 claims
How it works
Risks & contraindications
Evidence quality
Highlighted claims
- On edge devices, the dominant bottleneck shifts from HBM-to-SRAM transfer to flash-to-DRAM transfer. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
- Standard autoregressive LLM decoding becomes memory-bound because each generated token requires reloading weights and intermediate state. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
- Profiling Vicuna-7B on edge shows flash transfer dominates per-token latency. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
- Jetson AGX Orin has too little unified DRAM to keep a 7B model fully resident, requiring chunked streaming from flash. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
- Edge speculative decoding should optimize flash-to-DRAM movement rather than only maximizing accepted-token count. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration