“…tensor-train operators (Oseledets, 2011)) were proposed for a more effective representation of the linear structure of neural networks , which was used to compress deep neural networks (Novikov et al, 2015), convolutional neural networks (Garipov et al, 2016;Yu et al, 2017), and LSTM (Gao et al, 2020b;Sun et al, 2020a). Based on MPO decomposition, recent studies designed lightweight finetuning and compression methods for PLMs (Liu et al, 2021), developed parameter-efficient MoE architecture (Gao et al, 2022), over-parametrization PLMs and empirical study the emergency ability in quantized large language models . Unlike these works, our work aims to develop a very deep PLM with lightweight architecture and stable training.…”