2020
DOI: 10.48550/arxiv.2003.02245
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Data Augmentation using Pre-trained Transformer Models

Abstract: Language model based pre-trained models such as BERT have provided significant gains across different NLP tasks.In this paper, we study different types of pre-trained transformer based models such as autoregressive models (GPT-2), auto-encoder models (BERT), and seq2seq models (BART) for conditional data augmentation. We show that prepending the class labels to text sequences provides a simple yet effective way to condition the pre-trained models for data augmentation. On three classification benchmarks, pre-t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
86
0
1

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
4
1

Relationship

0
9

Authors

Journals

citations
Cited by 78 publications
(89 citation statements)
references
References 18 publications
2
86
0
1
Order By: Relevance
“…While data augmentation (DA) has been widely adopted in computer vision (Shorten & Khoshgoftaar, 2019), DA for language tasks is less straightforward. Recently, generative language models have been used to synthesize examples for various NLP tasks (Kumar et al, 2020;Anaby-Tavor et al, 2020;Puri et al, 2020;Yang et al, 2020). Different from these methods which focus on the low-resource language-only tasks, our method demonstrates the advantage of synthetic captions in large-scale vision-language pre-training.…”
Section: Data Augmentationmentioning
confidence: 99%
“…While data augmentation (DA) has been widely adopted in computer vision (Shorten & Khoshgoftaar, 2019), DA for language tasks is less straightforward. Recently, generative language models have been used to synthesize examples for various NLP tasks (Kumar et al, 2020;Anaby-Tavor et al, 2020;Puri et al, 2020;Yang et al, 2020). Different from these methods which focus on the low-resource language-only tasks, our method demonstrates the advantage of synthetic captions in large-scale vision-language pre-training.…”
Section: Data Augmentationmentioning
confidence: 99%
“…One technique for obtaining an abundance of examples uses recent Natural Language Generation (NLG) models ( §7.1). It has been shown in recent papers (Wei and Zou, 2019;Anaby-Tavor et al, 2019;Kumar et al, 2020;Amin-Nejad et al, 2020;Russo et al, 2020) that generating abundance of training examples can improve classifier performance. We aim to check whether this can improve our syntactic search method as well.…”
Section: Arxiv:210205007v1 [Cscl] 9 Feb 2021mentioning
confidence: 99%
“…For textual data, Zhang et al (2015); Wei & Zou (2019) and Wang (2015) respectively use lexical substitution based on the embedding space. Jiao et al (2019); Cheng et al (2019); Kumar et al (2020) generate augmented samples with a pre-trained language model. Some other techniques like back translation , random noise injection (Xie et al, 2017) and data mixup (Guo et al, 2019; are also proven to be useful.…”
Section: Related Workmentioning
confidence: 99%
“…Various methods have been proposed to generate augmented samples for textual data. Recently, large-scale pre-trained language models like BERT (Devlin et al, 2019) and GPT-2 (Radford et al, 2019) learn contextualized representations and have been used widely in generating high-quality augmented sentences (Jiao et al, 2019;Kumar et al, 2020). In this paper, we use a pre-trained BERT trained from masked language modeling to generate augmented samples.…”
Section: Example: Mmel Implementation On Natural Language Understandi...mentioning
confidence: 99%