Proceedings of SustaiNLP: Workshop on Simple and Efficient Natural Language Processing 2020
DOI: 10.18653/v1/2020.sustainlp-1.23
|View full text |Cite
|
Sign up to set email alerts
|

Do We Need to Create Big Datasets to Learn a Task?

Abstract: Deep Learning research has been largely accelerated by the development of huge datasets such as Imagenet. The general trend has been to create big datasets to make a deep neural network learn. A huge amount of resources is being spent in creating these big datasets, developing models, training them, and iterating this process to dominate leaderboards. We argue that the trend of creating bigger datasets needs to be revised by better leveraging the power of pre-trained language models. Since the language models … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 17 publications
0
5
0
Order By: Relevance
“…One successful strategy is to filter the training data, for example by removing duplicates (Lee et al, 2022), or excluding thematic document clusters that lead to undesirable model behavior (Kaddour, 2023). Mishra and Sachdeva (2020) used human-inspired heuristics to remove irrelevant and redundant data, aiming to select the optimal dataset for learning a specific task. Via a combination of coarse and fine pruning techniques, they achieved competitive results on out-of-distribution NLI datasets with only ∼2% of the SNLI training set.…”
Section: Data-efficient Nlpmentioning
confidence: 99%
“…One successful strategy is to filter the training data, for example by removing duplicates (Lee et al, 2022), or excluding thematic document clusters that lead to undesirable model behavior (Kaddour, 2023). Mishra and Sachdeva (2020) used human-inspired heuristics to remove irrelevant and redundant data, aiming to select the optimal dataset for learning a specific task. Via a combination of coarse and fine pruning techniques, they achieved competitive results on out-of-distribution NLI datasets with only ∼2% of the SNLI training set.…”
Section: Data-efficient Nlpmentioning
confidence: 99%
“…Do we really need big datasets? [29] Motivated by the process of human learning which relies on deep background knowledge about the world-we don't need access to hundreds of online materials to learn a topic, rather we intentionally avoid many noisy, distracting, and irrelevant materials-we probe this question. Considering that pre-training on large datasets has imparted linguistic knowledge to models like BERT [6] and RoBERTA [26], we realize that models no longer need to learn from scratch; instead, learning task-specific terminology (such as 'Entailment'/'Neutral'/'Contradiction'labels for Natural Language Inference) suffices, and might not necessitate the use of large datasets.…”
Section: Dataset Pruningmentioning
confidence: 99%
“…In our preliminary experiments [29] (Table 1), we utilize the first term of DQI C 1 (component 1) to prune SNLI [3] to ∼ 1 − 2% of its original size (550K). When RoBERTA is trained with our pruned dataset, it achieves near-equal performance on the SNLI dev set, as well as competitive zero-shot generalization on: (i) NLI Diagnostics [46], (ii) Stress Tests [32], and (iii) Adversarial NLI [33].…”
Section: Exploratory Analysismentioning
confidence: 99%
“…Sheng et al (2008) studies the tradeoff between collecting multiple labels per example vs. annotating more examples. Researchers have also explored different data labeling strategies, such as active learning (Fang et al, 2017), providing fine-grained rationales (Dua et al, 2020), retrospectively studying the amount of training data necessary for generalization (Mishra and Sachdeva, 2020), and the policy learning approach (Kratzwald et al, 2020).…”
Section: Related Workmentioning
confidence: 99%