Do Not Have Enough Data? Deep Learning to the Rescue!

Anaby-Tavor, Ateret; Carmeli, Boaz; Goldbraich, Esther; Kantor, Amir; Kour, George; Shlomov, Segev; Tepper, Naama; Zwerdling, Naama

doi:10.1609/aaai.v34i05.6233

Cited by 201 publications

(116 citation statements)

References 20 publications

Supporting

Mentioning

115

Contrasting

Order By: Relevance

“…GPT [36], GPT-2 [22] models are capable of producing grammatically correct, high-quality texts even when fine-tuned on small training data [14]. Nevertheless, the lack of ability to preserve or protect certain words from the original text cannot be assured by this method either.…”

Section: Text Generationmentioning

confidence: 99%

“…Hence augmentation can improve the robustness and performance of the models. Recently, many studies have been published to tackle the problem of data augmentation in the NLP field [14][15][16]. Some approaches depend more on the language or language models [14,17], while others are (almost) independent [15,18].…”

Section: Introductionmentioning

confidence: 99%

“…Recently, many studies have been published to tackle the problem of data augmentation in the NLP field [14][15][16]. Some approaches depend more on the language or language models [14,17], while others are (almost) independent [15,18]. However, when applying text augmentation, one must pay attention to the characteristics of the text and the problem to be solved, since both of these may affect what type of augmentation techniques can be applied.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Comparison of data augmentation methods for legal document classification

Csányi

Orosz

2021

Acta Tech. Jaurinensis

View full text Add to dashboard Cite

Sorting out the legal documents by their subject matter is an essential and time-consuming task due to the large amount of data. Many machine learning-based text categorization methods exist, which can resolve this problem. However, these algorithms can not perform well if they do not have enough training data for every category. Text augmentation can resolve this problem. Data augmentation is a widely used technique in machine learning applications, especially in computer vision. Textual data has different characteristics than images, so different solutions must be applied when the need for data augmentation arises. However, the type and different characteristics of the textual data or the task itself may reduce the number of methods that could be applied in a certain scenario. This paper focuses on text augmentation methods that could be applied to legal documents when classifying them into specific groups of subject matters.

show abstract

Section: Text Generationmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Comparison of data augmentation methods for legal document classification

Csányi

Orosz

2021

Acta Tech. Jaurinensis

View full text Add to dashboard Cite

show abstract

“…Some approaches in this group replicate samples through word replacements based on embeddings of the word and its surrounding context [23,16,26]. Other group of approaches have explored translation and back-translation [20,22], auto-regressive language models [1], and auto-encoders [13].…”

Section: Introductionmentioning

confidence: 99%

Exploring Conditional Language Model Based Data Augmentation Approaches for Hate Speech Classification

D'Sa

Illina

Fohr

et al. 2021

Text, Speech, and Dialogue

View full text Add to dashboard Cite

Deep Neural Network (DNN) based classifiers have gained increased attention in hate speech classification. However, the performance of DNN classifiers increases with quantity of available training data and in reality, hate speech datasets consist of only a small amount of labeled data. To counter this, Data Augmentation (DA) techniques are often used to increase the number of labeled samples and therefore, improve the classifier's performance. In this article, we explore augmentation of training samples using a conditional language model. Our approach uses a single class conditioned Generative Pre-Trained Transformer-2 (GPT-2) language model for DA, avoiding the need for multiple class specific GPT-2 models. We study the effect of increasing the quantity of the augmented data and show that adding a few hundred samples significantly improves the classifier's performance. Furthermore, we evaluate the effect of filtering the generated data used for DA. Our approach demonstrates up to 7.3% and up to 25.0% of relative improvements in macro-averaged F1 on two widely used hate speech corpora.

show abstract

“…Reconstructing the training data of the unobserved variables is quite difficult without the aid of a suitable model. Since the data in climate science, geophysics, and many other complex nonlinear systems are spatiotemporally correlated and intrinsically chaotic, the traditional data augmentation methods [4][5][6] for static data are not applicable to expanding the training data set of these problems. On the other hand, thanks to the development of many physics-based dynamical models in describing nature, long correlated time series from these models have been used for training the machine learning algorithms [7,8].…”

Section: Introductionmentioning

confidence: 99%

Can Short and Partial Observations Reduce Model Error and Facilitate Machine Learning Prediction?

Chen

2020

Entropy

View full text Add to dashboard Cite

Predicting complex nonlinear turbulent dynamical systems is an important and practical topic. However, due to the lack of a complete understanding of nature, the ubiquitous model error may greatly affect the prediction performance. Machine learning algorithms can overcome the model error, but they are often impeded by inadequate and partial observations in predicting nature. In this article, an efficient and dynamically consistent conditional sampling algorithm is developed, which incorporates the conditional path-wise temporal dependence into a two-step forward-backward data assimilation procedure to sample multiple distinct nonlinear time series conditioned on short and partial observations using an imperfect model. The resulting sampled trajectories succeed in reducing the model error and greatly enrich the training data set for machine learning forecasts. For a rich class of nonlinear and non-Gaussian systems, the conditional sampling is carried out by solving a simple stochastic differential equation, which is computationally efficient and accurate. The sampling algorithm is applied to create massive training data of multiscale compressible shallow water flows from highly nonlinear and indirect observations. The resulting machine learning prediction significantly outweighs the imperfect model forecast. The sampling algorithm also facilitates the machine learning forecast of a highly non-Gaussian climate phenomenon using extremely short observations.

show abstract

Do Not Have Enough Data? Deep Learning to the Rescue!

Cited by 201 publications

References 20 publications

Comparison of data augmentation methods for legal document classification

Comparison of data augmentation methods for legal document classification

Exploring Conditional Language Model Based Data Augmentation Approaches for Hate Speech Classification

Can Short and Partial Observations Reduce Model Error and Facilitate Machine Learning Prediction?

Contact Info

Product

Resources

About