exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources

Tai, Wen Kai; Kung, H. T.; Dong, Xin; Comiter, Marcus; Kuo, Chang‐Fu

doi:10.18653/v1/2020.findings-emnlp.129

Cited by 60 publications

(37 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…While we compared different novel token sequence embedding techniques, we did not study different ways of identifying subtoken sequences to add. Comparing AT to approaches such adding whole word tokens Tai et al (2020) would confirm our hypothesis that phrase-like token sequences are useful.…”

Section: Future Directionsmentioning

confidence: 54%

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Sachidananda¹,

Kessler²,

Lai³

2021

Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

Contextual embedding-based language models trained on large data sets, such as BERT and RoBERTa, provide strong performance across a wide range of tasks and are ubiquitous in modern NLP. It has been observed that fine-tuning these models on tasks involving data from domains different from that on which they were pretrained can lead to suboptimal performance. Recent work has explored approaches to adapt pretrained language models to new domains by incorporating additional pretraining using domain-specific corpora and task data. We propose an alternative approach for transferring pretrained language models to new domains by adapting their tokenizers. We show that domain-specific subword sequences can be efficiently determined directly from divergences in the conditional token distributions of the base and domain-specific corpora. In datasets from four disparate domains, we find adaptive tokenization on a pretrained RoBERTa model provides >97% of the performance benefits of domain specific pretraining. Our approach produces smaller models and less training and inference time than other approaches using tokenizer augmentation. While adaptive tokenization incurs a 6% increase in model parameters in our experimentation, due to the introduction of 10k new domain-specific tokens, our approach, using 64 vCPUs, is 72x faster than further pretraining the language model on domain-specific corpora on 8 TPUs.

show abstract

Section: Future Directionsmentioning

confidence: 54%

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Sachidananda¹,

Kessler²,

Lai³

2021

Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…As shown by previous work, this may lead to further improvements in performance on the clinical tasks. Another approach is to manually add specific tokens to the vocabulary of a pretrained model, as explored by Tai et al (2020). An informed set of tokens could potentially be extracted by a new tokenizer specifically trained on the in-domain data, and in a later step, incorporate the set difference to the original tokenizer's vocabulary.…”

Section: Discussionmentioning

confidence: 99%

Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data

Lamproudis¹,

Henriksson²,

Dalianis³

2021

Proceedings of the Conference Recent Advances in Natural Language Processing - Deep Learning for Natural Language Processing Me

View full text Add to dashboard Cite

“…Moreover, SciBERT (Beltagy et al, 2019) found that in-domain vocabulary is helpful but not significant while we attribute it to the inefficiency of implicit learning of in-domain vocabulary. To represent OOV words in multilingual settings, the mixture mapping method (Wang et al, 2019) utilized a mixture of English subwords embedding, but it has been shown useless for domain-specific words by Tai et al (2020). ExBERT (Tai et al, 2020) applied an extension module to adapt an augmenting embedding for the in-domain vocabulary but it still needs large continuous pre-training.…”

Section: Related Workmentioning

confidence: 99%

“…To represent OOV words in multilingual settings, the mixture mapping method (Wang et al, 2019) utilized a mixture of English subwords embedding, but it has been shown useless for domain-specific words by Tai et al (2020). ExBERT (Tai et al, 2020) applied an extension module to adapt an augmenting embedding for the in-domain vocabulary but it still needs large continuous pre-training. Similar to our work, they highlight the importance of the domain-specific words but all of these work neither explore the understanding of performance drop during a domain shift nor examine the importance of multi-grained information.…”

Section: Related Workmentioning

confidence: 99%

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Diao¹,

Xu²,

Hongjin³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

View full text Add to dashboard Cite

Large pre-trained models such as BERT are known to improve different downstream NLP tasks, even when such a model is trained on a generic domain. Moreover, recent studies have shown that when large domain-specific corpora are available, continued pre-training on domain-specific data can further improve the performance of in-domain tasks. However, this practice requires significant domainspecific data and computational resources which may not always be available. In this paper, we aim to adapt a generic pretrained model with a relatively small amount of domain-specific data. We demonstrate that by explicitly incorporating the multi-granularity information of unseen and domain-specific words via the adaptation of (word based) ngrams, the performance of a generic pretrained model can be greatly improved. Specifically, we introduce a Transformer-based Domainaware N-gram Adaptor, T-DNA, to effectively learn and incorporate the semantic representation of different combinations of words in the new domain. Experimental results illustrate the effectiveness of T-DNA on eight lowresource downstream tasks from four domains. We show that T-DNA is able to achieve significant improvements compared to existing methods on most tasks using limited data with lower computational costs. Moreover, further analyses demonstrate the importance and effectiveness of both unseen words and the information of different granularities. 1

show abstract

exBERT: Extending Pre-trained Models with Domain-specific Vocabulary Under Constrained Training Resources

Cited by 60 publications

References 11 publications

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Efficient Domain Adaptation of Language Models via Adaptive Tokenization

Developing a Clinical Language Model for Swedish: Continued Pretraining of Generic BERT with In-Domain Data

Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation

Contact Info

Product

Resources

About