Even though deep neural models have achieved superhuman performance on many popular benchmarks, they have failed to generalize to OOD or adversarial datasets. Conventional approaches aimed at increasing robustness include developing increasingly large models and augmentation with large scale datasets. However, orthogonal to these trends, we hypothesize that a smaller, high quality dataset is what we need. Our hypothesis is based on the fact that deep neural networks are data driven models, and data is what leads/misleads models. In this work, we propose an empirical study that examines how to select a subset of and/or create high quality benchmark data, for a model to learn effectively. We seek to answer if big datasets are truly needed to learn a task, and whether a smaller subset of high quality data can replace big datasets. We plan to investigate both data pruning and data creation paradigms to generate high quality datasets.
IntroductionDeep neural models such as EfficientNet-B7 [40], BERT [6] and RoBERTA [26] have achieved super-human performance on many popular benchmarks in various domains such as Imagenet [37], SNLI [3], and SQUAD [36]. However, their performance drops drastically on exposure to out of distribution (OOD) and adversarial datasets [14,8,15,13]. Lots of resources and time are being invested in developing better models and architectures, such as transformer based approaches [45], that dominate leaderboards. Since deep learning -a data driven approach-finds representation from data, shouldn't the focus be placed on creating 'better' datasets rather than developing increasingly complex models? Let us consider this through an analogy-a student (A) is asked to self-learn a concept by going through a question bank (Q 1 ), where there are 1000 solved questions. After self-learning, A is tested using 100 unsolved questions present at the end of the Q 1 . While A achieves unprecedented performance (85/100), beating other students who are explicitly taught the concept, when tested on another 100 questions on the same topic from question bank (Q 2 ), A fails on 50 questions. Similarly, if A is interviewed by a teacher, A fails to answer 70 questions. On analysis, we see that A has not truly learned the concept in Q 1 ; instead, A solves questions by relying on common question patterns seen in Q 1 , and associating them with the provided answers. To fix this, suppose A is provided 1000 solved questions from Q 2 . On testing, we find that A now correctly answers 90/100 unsolved questions from Q 2 , but only 55/100 from Q 1 , and 35/100 in the interview. Now, we provide A with 100 question banks in a similar manner, and find that A's performance on both Q 1 and Q 2 is 70/100, and is 40/100 for the interview. To improve interview performance, suppose that the interviewer prepares an additional question bank Q i , then if A selflearns using both Q 1 and Q i , then the scores for Q 1 , Q 2 , and the interview are 80, 45, and 80/100 Pre-registration workshop NeurIPS (2020),