A Survey of Data Augmentation Approaches for NLP

Feng, Steven Y.; Gangal, Varun; Wei, Jason; Chandar, Sarath; Vosoughi, Soroush; Mitamura, Teruko; Hovy, Eduard

doi:10.18653/v1/2021.findings-acl.84

Cited by 295 publications

(84 citation statements)

References 124 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

Section: Introductionmentioning

confidence: 99%

“…Common image augmentation approaches include copying and warping an image, i.e via cropping and rotation 8 . NLP augmentation techniques may include copying a sentence and substituting words with synonyms to preserve meaning or translating a sentence into another language and back again 18–20 . Additionally synthetic data can be generated through a variety of techniques including Generative Adversarial Networks (GANs) and the Synthetic Minority Oversampling Technique (SMOTE) 8,21 .…”

Section: Introductionmentioning

confidence: 99%

“…In the ML-related fields of computer vision 8,16,17 and natural language processing (NLP) [18][19][20] , data augmentation is commonly applied to combat data limitations. Data augmentation refers to techniques that artificially increase the number of training examples, which can lead to improved performance and act as a regularizer in reducing overfitting.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Nucleotide augmentation for machine learning-guided protein engineering

Minot

Reddy

2022

Preprint

View full text Add to dashboard Cite

Machine learning-guided protein engineering is a rapidly advancing field. Despite major experimental and computational advances however, collecting protein genotype (sequence) and phenotype (function) data remains time and resource intensive. As a result, the quality and quantity of training data is often a limiting factor in developing machine learning models. Data augmentation techniques have been successfully applied to the fields of computer vision and natural language processing, however, there is a lack of such augmentation techniques for biological sequence data. Towards this end we develop nucleotide augmentation (NTA), which leverages natural nucleotide codon degeneracy to augment protein sequence data in a biologically meaningful way. As a proof of concept for protein engineering, we apply NTA to train machine learning models with benchmark data sets of protein genotype and phenotype, revealing performance gains on par and surpassing benchmarks models, even when only using a fraction of the training data. NTA also enables substantial improvements for classification tasks under heavy class imbalance. Availability and implementation: The code to use NTA and to reproduce the analyses in this study is publicly available at https://github.com/minotm/NTA

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Nucleotide augmentation for machine learning-guided protein engineering

Minot

Reddy

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…In other domains, researchers have developed various data augmentation techniques to overcome this bottleneck, enhancing the generalization of deep networks given limited data. For example, data augmentation has been used in computer vision [2], natural language processing [3], and semi-supervised learning [4].…”

Section: Introductionmentioning

confidence: 99%

Contrastive-mixup learning for improved speaker verification

Zhang,

Jin,

Cheng

et al. 2022

Preprint

View full text Add to dashboard Cite

This paper proposes a novel formulation of prototypical loss with mixup for speaker verification. Mixup is a simple yet efficient data augmentation technique that fabricates a weighted combination of random data point and label pairs for deep neural network training. Mixup has attracted increasing attention due to its ability to improve robustness and generalization of deep neural networks. Although mixup has shown success in diverse domains, most applications have centered around closed-set classification tasks. In this work, we propose contrastive-mixup, a novel augmentation strategy that learns distinguishing representations based on a distance metric. During training, mixup operations generate convex interpolations of both inputs and virtual labels. Moreover, we have reformulated the prototypical loss function such that mixup is enabled on metric learning objectives. To demonstrate its generalization given limited training data, we conduct experiments by varying the number of available utterances from each speaker in the VoxCeleb database. Experimental results show that applying contrastive-mixup outperforms the existing baseline, reducing error rate by 16% relatively, especially when the number of training utterances per speaker is limited.

show abstract

“…Large numbers of DA methods have been proposed recently, and a survey of existing methods is beneficial so that researchers could keep up with the speed of innovation. Liu et al [2] and Feng et al [3] both present surveys that give a bird's eye view of DA for NLP. They directly divide the categories according to the methods.…”

Section: Introductionmentioning

confidence: 99%

Data Augmentation Approaches in Natural Language Processing: A Survey

Li,

Hou,

Che

2021

Preprint

View full text Add to dashboard Cite

As an effective strategy, data augmentation (DA) alleviates data scarcity scenarios where deep learning techniques may fail. It is widely applied in computer vision then introduced to natural language processing and achieves improvements in many tasks. One of the main focuses of the DA methods is to improve the diversity of training data, thereby helping the model to better generalize to unseen testing data. In this survey, we frame DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling. Our paper sets out to analyze DA methods in detail according to the above categories. Further, we also introduce their applications in NLP tasks as well as the challenges.

show abstract

A Survey of Data Augmentation Approaches for NLP

Cited by 295 publications

References 124 publications

Nucleotide augmentation for machine learning-guided protein engineering

Nucleotide augmentation for machine learning-guided protein engineering

Contrastive-mixup learning for improved speaker verification

Data Augmentation Approaches in Natural Language Processing: A Survey

Contact Info

Product

Resources

About