Do We Need to Create Big Datasets to Learn a Task?

Mishra, Swaroop; Sachdeva, Bhavdeep

doi:10.18653/v1/2020.sustainlp-1.23

Cited by 7 publications

(5 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…One successful strategy is to filter the training data, for example by removing duplicates (Lee et al, 2022), or excluding thematic document clusters that lead to undesirable model behavior (Kaddour, 2023). Mishra and Sachdeva (2020) used human-inspired heuristics to remove irrelevant and redundant data, aiming to select the optimal dataset for learning a specific task. Via a combination of coarse and fine pruning techniques, they achieved competitive results on out-of-distribution NLI datasets with only ∼2% of the SNLI training set.…”

Section: Data-efficient Nlpmentioning

confidence: 99%

ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

Jumelet,

Hanna,

de Heer Kloots

et al. 2023

Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

We present the submission of the ILLC at the University of Amsterdam to the BabyLM challenge (Warstadt et al., 2023), in the strict-small track.Our final model, ChapGTP, is a masked language model that was trained for 200 epochs, aided by a novel data augmentation technique called Automatic Task Formation. We discuss in detail the performance of this model on the three evaluation suites: BLiMP, (Super)GLUE, and MSGS. Furthermore, we present a wide range of methods that were ultimately not included in the model, but may serve as inspiration for training LMs in low-resource settings.

show abstract

Section: Data-efficient Nlpmentioning

confidence: 99%

ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

Jumelet,

Hanna,

de Heer Kloots

et al. 2023

Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

show abstract

“…Do we really need big datasets? [29] Motivated by the process of human learning which relies on deep background knowledge about the world-we don't need access to hundreds of online materials to learn a topic, rather we intentionally avoid many noisy, distracting, and irrelevant materials-we probe this question. Considering that pre-training on large datasets has imparted linguistic knowledge to models like BERT [6] and RoBERTA [26], we realize that models no longer need to learn from scratch; instead, learning task-specific terminology (such as 'Entailment'/'Neutral'/'Contradiction'labels for Natural Language Inference) suffices, and might not necessitate the use of large datasets.…”

Section: Dataset Pruningmentioning

confidence: 99%

“…In our preliminary experiments [29] (Table 1), we utilize the first term of DQI C 1 (component 1) to prune SNLI [3] to ∼ 1 − 2% of its original size (550K). When RoBERTA is trained with our pruned dataset, it achieves near-equal performance on the SNLI dev set, as well as competitive zero-shot generalization on: (i) NLI Diagnostics [46], (ii) Stress Tests [32], and (iii) Adversarial NLI [33].…”

Section: Exploratory Analysismentioning

confidence: 99%

A Proposal to Study "Is High Quality Data All We Need?"

Mishra¹,

Arunkumar²

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Even though deep neural models have achieved superhuman performance on many popular benchmarks, they have failed to generalize to OOD or adversarial datasets. Conventional approaches aimed at increasing robustness include developing increasingly large models and augmentation with large scale datasets. However, orthogonal to these trends, we hypothesize that a smaller, high quality dataset is what we need. Our hypothesis is based on the fact that deep neural networks are data driven models, and data is what leads/misleads models. In this work, we propose an empirical study that examines how to select a subset of and/or create high quality benchmark data, for a model to learn effectively. We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets. We plan to investigate both data pruning and data creation paradigms to generate high quality datasets. IntroductionDeep neural models such as EfficientNet-B7 [40], BERT [6] and RoBERTA [26] have achieved super-human performance on many popular benchmarks in various domains such as Imagenet [37], SNLI [3], and SQUAD [36]. However, their performance drops drastically on exposure to out of distribution (OOD) and adversarial datasets [14,8,15,13]. Lots of resources and time are being invested in developing better models and architectures, such as transformer based approaches [45], that dominate leaderboards. Since deep learning -a data driven approach-finds representation from data, shouldn't the focus be placed on creating 'better' datasets rather than developing increasingly complex models? Let us consider this through an analogy-a student (A) is asked to self-learn a concept by going through a question bank (Q 1 ), where there are 1000 solved questions. After self-learning, A is tested using 100 unsolved questions present at the end of the Q 1 . While A achieves unprecedented performance (85/100), beating other students who are explicitly taught the concept, when tested on another 100 questions on the same topic from question bank (Q 2 ), A fails on 50 questions. Similarly, if A is interviewed by a teacher, A fails to answer 70 questions. On analysis, we see that A has not truly learned the concept in Q 1 ; instead, A solves questions by relying on common question patterns seen in Q 1 , and associating them with the provided answers. To fix this, suppose A is provided 1000 solved questions from Q 2 . On testing, we find that A now correctly answers 90/100 unsolved questions from Q 2 , but only 55/100 from Q 1 , and 35/100 in the interview. Now, we provide A with 100 question banks in a similar manner, and find that A's performance on both Q 1 and Q 2 is 70/100, and is 40/100 for the interview. To improve interview performance, suppose that the interviewer prepares an additional question bank Q i , then if A selflearns using both Q 1 and Q i , then the scores for Q 1 , Q 2 , and the interview are 80, 45, and 80/100 Pre-registration workshop NeurIPS (2020),

show abstract

“…Sheng et al (2008) studies the tradeoff between collecting multiple labels per example vs. annotating more examples. Researchers have also explored different data labeling strategies, such as active learning (Fang et al, 2017), providing fine-grained rationales (Dua et al, 2020), retrospectively studying the amount of training data necessary for generalization (Mishra and Sachdeva, 2020), and the policy learning approach (Kratzwald et al, 2020).…”

Section: Related Workmentioning

confidence: 99%

Learning with Different Amounts of Annotation: From Zero to Many Labels

Zhang¹,

Gong²,

Choi³

2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Training NLP systems typically assumes access to annotated data that has a single human label per example. Given imperfect labeling from annotators and inherent ambiguity of language, we hypothesize that single label is not sufficient to learn the spectrum of language interpretation. We explore new annotation distribution schemes, assigning multiple labels per example for a small subset of training examples. Introducing such multi label examples at the cost of annotating fewer examples brings clear gains on natural language inference task and entity typing task, even when we simply first train with a single label data and then fine tune with multi label examples. Extending a MixUp data augmentation framework, we propose a learning algorithm that can learn from training examples with different amount of annotation (with zero, one, or multiple labels). This algorithm efficiently combines signals from uneven training data and brings additional gains in low annotation budget and cross domain settings. Together, our method achieves consistent gains in two tasks, suggesting distributing labels unevenly among training examples can be beneficial for many NLP tasks. 1

show abstract

Do We Need to Create Big Datasets to Learn a Task?

Cited by 7 publications

References 17 publications

ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

ChapGTP, ILLC’s Attempt at Raising a BabyLM: Improving Data Efficiency by Automatic Task Formation

A Proposal to Study "Is High Quality Data All We Need?"

Learning with Different Amounts of Annotation: From Zero to Many Labels

Contact Info

Product

Resources

About