Utterance classification is a key component in many conversational systems. However, classifying real-world user utterances is challenging, as people may express their ideas and thoughts in manifold ways, and the amount of training data for some categories may be fairly limited, resulting in imbalanced data distributions. To alleviate these issues, we conduct a comprehensive survey regarding data augmentation approaches for text classification, including simple random resampling, word-level transformations, and neural text generation to cope with imbalanced data. Our experiments focus on multi-class datasets with a large number of data samples, which has not been systematically studied in previous work. The results show that the effectiveness of different data augmentation schemes depends on the nature of the dataset under consideration.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.