Findings of the Association for Computational Linguistics: EMNLP 2020 2020
DOI: 10.18653/v1/2020.findings-emnlp.129
|View full text |Cite
|
Sign up to set email alerts
|

exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources

Abstract: We introduce exBERT, a training method to extend BERT pre-trained models from a general domain to a new pre-trained model for a specific domain with a new additive vocabulary under constrained training resources (i.e., constrained computation and data). exBERT uses a small extension module to learn to adapt an augmenting embedding for the new domain in the context of the original BERT's embedding of a general vocabulary. The exBERT training method is novel in learning the new vocabulary and the extension modul… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
34
0
3

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 60 publications
(37 citation statements)
references
References 11 publications
0
34
0
3
Order By: Relevance
“…While we compared different novel token sequence embedding techniques, we did not study different ways of identifying subtoken sequences to add. Comparing AT to approaches such adding whole word tokens Tai et al (2020) would confirm our hypothesis that phrase-like token sequences are useful.…”
Section: Future Directionsmentioning
confidence: 54%
“…While we compared different novel token sequence embedding techniques, we did not study different ways of identifying subtoken sequences to add. Comparing AT to approaches such adding whole word tokens Tai et al (2020) would confirm our hypothesis that phrase-like token sequences are useful.…”
Section: Future Directionsmentioning
confidence: 54%
“…As shown by previous work, this may lead to further improvements in performance on the clinical tasks. Another approach is to manually add specific tokens to the vocabulary of a pretrained model, as explored by Tai et al (2020). An informed set of tokens could potentially be extracted by a new tokenizer specifically trained on the in-domain data, and in a later step, incorporate the set difference to the original tokenizer's vocabulary.…”
Section: Discussionmentioning
confidence: 99%
“…Moreover, SciBERT (Beltagy et al, 2019) found that in-domain vocabulary is helpful but not significant while we attribute it to the inefficiency of implicit learning of in-domain vocabulary. To represent OOV words in multilingual settings, the mixture mapping method (Wang et al, 2019) utilized a mixture of English subwords embedding, but it has been shown useless for domain-specific words by Tai et al (2020). ExBERT (Tai et al, 2020) applied an extension module to adapt an augmenting embedding for the in-domain vocabulary but it still needs large continuous pre-training.…”
Section: Related Workmentioning
confidence: 99%
“…To represent OOV words in multilingual settings, the mixture mapping method (Wang et al, 2019) utilized a mixture of English subwords embedding, but it has been shown useless for domain-specific words by Tai et al (2020). ExBERT (Tai et al, 2020) applied an extension module to adapt an augmenting embedding for the in-domain vocabulary but it still needs large continuous pre-training. Similar to our work, they highlight the importance of the domain-specific words but all of these work neither explore the understanding of performance drop during a domain shift nor examine the importance of multi-grained information.…”
Section: Related Workmentioning
confidence: 99%