Memory-Limited Edge Inference

Cross-source consensus on Memory-Limited Edge Inference from 1 sources and 5 claims.

1 sources · 5 claims

How it works

On edge devices, the dominant bottleneck shifts from HBM-to-SRAM transfer to flash-to-DRAM transfer. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
Standard autoregressive LLM decoding becomes memory-bound because each generated token requires reloading weights and intermediate state. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
Profiling Vicuna-7B on edge shows flash transfer dominates per-token latency. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
Jetson AGX Orin has too little unified DRAM to keep a 7B model fully resident, requiring chunked streaming from flash. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
Edge speculative decoding should optimize flash-to-DRAM movement rather than only maximizing accepted-token count. — CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration