“…However, as the size and complexity of LLMs rapidly increase [29,30,31], this conventional approach becomes impractical and costly, prompting the need for retraining-free compression techniques. Recent developments in this area have primarily centered around quantization [32,33,34] and have expanded to include pruning methods [13,15,14] that eliminate the need for retraining. In this paper, our work targets enhancing the performance of the retraining-free pruning paradigm, which can reduce the model size, lower the memory consumption, accelerate the inference, and be orthogonal and compatible with quantization for further compression simultaneously.…”