Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) 2022
DOI: 10.18653/v1/2022.acl-short.24
|View full text |Cite
|
Sign up to set email alerts
|

Kronecker Decomposition for GPT Compression

Abstract: GPT is an auto-regressive Transformer-based pre-trained language model which has attracted a lot of attention in the natural language processing (NLP) domain. The success of GPT is mostly attributed to its pre-training on huge amount of data and its large number of parameters. Despite the superior performance of GPT, this overparameterized nature of GPT can be very prohibitive for deploying this model on devices with limited computational power or memory. This problem can be mitigated using model compression t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 10 publications
(7 citation statements)
references
References 29 publications
0
7
0
Order By: Relevance
“…DRONE achieves better performance than SVD. Besides, as an alternative to SVD, Kronecker decomposition retains the rank of the matrix and has shown improvement compressing BERT and GPT-2 (Tahaei et al 2021;Edalati et al 2022).…”
Section: Low-rank Factorizationmentioning
confidence: 99%
“…DRONE achieves better performance than SVD. Besides, as an alternative to SVD, Kronecker decomposition retains the rank of the matrix and has shown improvement compressing BERT and GPT-2 (Tahaei et al 2021;Edalati et al 2022).…”
Section: Low-rank Factorizationmentioning
confidence: 99%
“…As mentioned in the introduction, fewer works in this category have been proposed compared to Knowledge Distillation on encoders. KnGPT2 [33] compresses the embedding and Transformer layers of GPT-2 using Kronecker decomposition. It uses KD to compensate for the performance drop of the compressed model.…”
Section: Knowledge Distillation On Transformermentioning
confidence: 99%
“…GPT's victory can largely be attributed to its extensive pre-formation of massive amounts of data and its high characteristics (ranging from 100 million to billions). Although GPT has improved performance (particularly in very few zero-shot setups), its over parameterized character makes it difficult to deploy on systems with low computing capabilities or storage [20].…”
Section: Gptmentioning
confidence: 99%