2021
DOI: 10.48550/arxiv.2106.04563
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

XtremeDistilTransformers: Task Transfer for Task-agnostic Distillation

Abstract: While deep and large pre-trained models are the state-of-the-art for various natural language processing tasks, their huge size poses significant challenges for practical uses in resource constrained settings. Recent works in knowledge distillation propose task-agnostic as well as task-specific methods to compress these models, with task-specific ones often yielding higher compression rate. In this work, we develop a new task-agnostic distillation framework XtremeDistilTransformers that leverages the advantage… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
9
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3
3

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(10 citation statements)
references
References 28 publications
1
9
0
Order By: Relevance
“…Note, however, that the majority of the performance decrease due to the increasing depth is caused by only a single task -CoLA. This behaviour has previously been observed in the literature and is in line with other work trying to compress BERT behaviour into smaller models (Sun et al, 2019;Turc et al, 2019;Mukherjee et al, 2021). If we disregard CoLA, at least 98.6% of the predictive performance is preserved by all UltraFastBERT model.…”
Section: Resultssupporting
confidence: 90%
“…Note, however, that the majority of the performance decrease due to the increasing depth is caused by only a single task -CoLA. This behaviour has previously been observed in the literature and is in line with other work trying to compress BERT behaviour into smaller models (Sun et al, 2019;Turc et al, 2019;Mukherjee et al, 2021). If we disregard CoLA, at least 98.6% of the predictive performance is preserved by all UltraFastBERT model.…”
Section: Resultssupporting
confidence: 90%
“…Regular expressions were still used in several cases to extract the key features. Then, di↵erent techniques were adopted to carry out the blocking itself, including sorted neighborhood, similarity joins [7,8], sentence encoding using BERT [10], a neural architecture based on a distilled transformer [15], and exploiting additional training data [16] to perform supervised contrastive learning [14]. These techniques were often followed by a pair/block cleaning and ranking step (based on intra-pair similarity) to comply with the submission structure.…”
Section: Solution Highlightsmentioning
confidence: 99%
“…XDL has multiple variants available on 'huggingface' regarding encoder layers, hidden size, and attention heads. In this study, Xtremedistil-l6-h256-uncased has been used keeping the small model parameter in mind, as it only has 12.7 million tunable parameters [52].…”
Section: ) Xtremedistilmentioning
confidence: 99%