Enabling Lightweight Fine-tuning for Pre-trained Language Model Compression based on Matrix Product Operators

Liu, Peiyu; Gao, Ze-Feng; Zhao, Wayne Xin; Xie, Z. Y.; Lu, Zhong-Yi; Wen, Ji-Rong

doi:10.18653/v1/2021.acl-long.418

Cited by 6 publications

(6 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…An example of such clever data compression schemes based on TN and MPO decomposition has already been introduced in the previous subsection. Using MPO as an efficient representation for weight matrices of a NN was originally suggested in the ML community under the name tensor trains [44] and later reintroduced in other contexts for systematic compression of fully connected NN models [34,45], for solving partial differential equations with NNs [31] and for language models [46] and speech processing [47].…”

Section: B Tensorizing Standard Neural Networkmentioning

confidence: 99%

Variational Tensor Neural Networks for Deep Learning

Jahromi,

Orús

2024

Preprint

View full text Add to dashboard Cite

Deep neural networks (NNs) encounter scalability limitations when confronted with a vast array of neurons, thereby constraining their achievable network depth. To address this challenge, we propose an integration of tensor networks (TN) into NN frameworks, combined with a variational DMRG-inspired training technique. This inturn, results in a scalable tensor neural network (TNN) architecture capable of efficient training over a large parameter space. Our variational algorithm utilizes a local gradient-descent technique, enabling manual or automatic computation of tensor gradients, facilitating design of hybrid TNN models with combined dense and tensor layers. Our training algorithm sheds light on the entanglement structure of the tensorized trainable weights, while also illuminating the expressive power of the TNN as a quantum neural state. We validate the accuracy and efficiency of our method by designing TNN models for regression and classification tasks on diverse datasets. Furthermore, we delve into the expressive power of our algorithm, drawing upon the entanglement structure of the neural network.

show abstract

Section: B Tensorizing Standard Neural Networkmentioning

confidence: 99%

Variational Tensor Neural Networks for Deep Learning

Jahromi,

Orús

2024

Preprint

View full text Add to dashboard Cite

show abstract

“…tensor-train operators (Oseledets, 2011)) were proposed for a more effective representation of the linear structure of neural networks , which was used to compress deep neural networks (Novikov et al, 2015), convolutional neural networks (Garipov et al, 2016;Yu et al, 2017), and LSTM (Gao et al, 2020b;Sun et al, 2020a). Based on MPO decomposition, recent studies designed lightweight finetuning and compression methods for PLMs (Liu et al, 2021), developed parameter-efficient MoE architecture (Gao et al, 2022), over-parametrization PLMs and empirical study the emergency ability in quantized large language models . Unlike these works, our work aims to develop a very deep PLM with lightweight architecture and stable training.…”

Section: Related Workmentioning

confidence: 99%

“…Second, it should not affect the capacity to capture layer-specific variations. To achieve this, we utilize the MPO decomposition (Liu et al, 2021) to develop a parameterefficient architecture by sharing informative components across layers and keeping layer-specific supplementary components (Section 3.2). As another potential issue, it is difficult to optimize deep PLMs due to unstable training (Wang et al, 2022b), especially when weight sharing (Lan et al, 2019) is involved.…”

Section: Overview Of Our Approachmentioning

confidence: 99%

“…Following Gao et al (2022), we further simplify the decomposition results of a matrix as a central tensor C (the middle tensor) and auxiliary tensors {A i } n−1 i=1 (the rest tensor). As a major merit, such a decomposition can effectively reorganize and aggregate the information of the matrix (Liu et al, 2021): central tensor C can encode the essential information of the original matrix, while auxiliary tensors {A i } n−1 i=1 serve as its complement to exactly reconstruct the matrix.…”

Section: Mpo Decompositionmentioning

confidence: 99%

“…They only contain a very small proportion of parameters, which does not significantly increase the model size. While, another merit of MPO decomposition is that these tensors are highly correlated via bond dimensions, and a small perturbation on an auxiliary tensor can reflect the whole matrix (Liu et al, 2021). If the downstream task requires more layer specificity, we can further incorporate low-rank adapters (Hu et al, 2021) for layer-specific adaptation.…”

Section: Mpo-based Scaling To Deep Modelsmentioning

confidence: 99%

See 2 more Smart Citations

Enhancing Scalability of Pre-trained Language Models via Efficient Parameter Sharing

Liu,

Gao,

Chen

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

In this paper, we propose a highly parameterefficient approach to scaling pre-trained language models (PLMs) to a deeper model depth. Unlike prior work that shares all parameters or uses extra blocks, we design a more capable parameter-sharing architecture based on matrix product operator (MPO), an efficient tensor decomposition method to factorize the parameter matrix into a set of local tensors. Based on such a decomposition, we share the important local tensor across all layers for reducing the model size and meanwhile keep layerspecific tensors (also using Adapters) for enhancing the adaptation flexibility. To improve the model training, we further propose a stable initialization algorithm tailored for the MPObased architecture. Extensive experiments have demonstrated the effectiveness of our proposed model in enhancing scalability and achieving higher performance (i.e., with fewer parameters than BERT BASE , we successfully scale the model depth by a factor of 4× and even achieve 0.1 points higher than BERT LARGE for GLUE score). The code to reproduce the results of this paper can be found at https: //github.com/RUCAIBox/MPOBERT-code.

show abstract