“…The quadratic nature of attention mechanisms has led to a substantial surge in memory consumption for LLMs, thereby magnifying the significance of effective memory management [8,16,58]. Researchers have proposed various algorithmic optimizations aimed at curbing memory consumption, including quantization techniques [22, 24-26, 32, 46, 47, 81, 94], pruning strategies [19-21, 27, 39, 52, 64, 65, 80, 93], and KV-cache compression approaches [5,37], compilation [9,34,[90][91][92] and scheduling [11,23,50,51,54,85].…”