Progress in Natural Language Processing (NLP) has been dictated by the
rule of more
: more data, more computing power, more complexity, best exemplified by Deep Learning Transformers. However, training (or fine-tuning) large dense models for specific applications usually requires significant amounts of computing resources. One way to ameliorate this problem is through data engineering (DE) instead of the algorithmic or hardware perspectives. Our focus here is an under-investigated DE technique, with enormous potential in the current scenario –
Instance Selection
(IS) (aka, Selective Sampling; Prototype Selection). The IS goal is
to reduce the training set size by removing noisy or redundant instances while maintaining or improving the effectiveness (accuracy) of the trained models and reducing the training process cost
. We survey classical and recent state-of-the-art IS techniques and provide a scientifically sound comparison of IS methods applied to an essential NLP task - Automatic Text Classification (ATC). IS methods have been normally applied to small tabular datasets and have not been systematically compared in ATC. We consider several neural and non-neural state-of-the-art ATC solutions and many datasets. We answer several research questions based on tradeoffs induced by a tripod (Effectiveness, Efficiency, Training-Set Reduction). Our answers reveal an enormous unfulfilled potential for IS solutions. Specially, we show that in 12 out of 19 datasets, specific IS methods - namely Condensed Nearest Neighbor (CNN), Local Set-based Smoother (LSSm) and Local Set Border Selector (LSBo) – can reduce the size of the training set without effectiveness losses. Furthermore, in the case of fine-tuning the transformer methods, the IS methods reduce the amount of data needed, without losing effectiveness and with considerable training-time gains.