Data augmentation is an important component in the robustness evaluation of models in natural language processing (NLP) and in enhancing the diversity of the data they are trained on. In this paper, we present NL-Augmenter, a new participatory Pythonbased natural language augmentation framework which supports the creation of both transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of natural language tasks. We demonstrate the efficacy of NL-Augmenter by using several of its tranformations to analyze the robustness of popular natural language models. The infrastructure, datacards and robutstness analysis results are available publicly on the NL-Augmenter repository (https://github. com/GEM-benchmark/NL-Augmenter).
No abstract
Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP. El aumento de datos es un método importante para evaluar la solidez y mejorar la diversidad del entrenamiento datos para modelos de procesamiento de lenguaje natural (NLP). इस लेख में, हम एनएल-ऑगमेंटर का प्रस्ताव करते हैं - एक नया भागी- दारी पूर्वक, पायथन में बनाया गया, लैंग्वेज (एनएल) ऑग्मेंटेशन फ्रेमवर्क जो ट्रांसफॉर्मेशन (डेटा में बदलाव करना) और फीलटर (फीचर्स के अनुसार डेटा का भाग करना) के नीरमान का समर्थन करता है।. 我们描述了NL-Augmenter框架及其初步包含的117种转换和23个过滤器,并 大致标注分类了一系列可适配的自然语言任务. این دگرگونی ها شامل نویز، اشتباهات عمدی و تصادفی انسانی، تنوع اجتماعی-زبانی، سبک معنایی معتبر، تغییرات نحوی و همچنین ساختارهای مصنوعی است که برای انسان ها مبهم است. NL-Augmenterpa allin kaynintam qawachiyku, tikrakuyninku- nata servichikuspayku, chaywanmi qawariyku modelos de lenguaje popular nisqapa allin takyasqa kayninta. Kami menemukan model yang berbeda ditantang secara berbeda pada tugas yang berbeda, dengan penurunan skor kuasi-sistematis. Infrastruktur, kartu data, dan hasil evaluasi ketahanan dipublikasikan tersedia secara gratis di GitHub untuk kepentingan para peneliti yang mengerjakan pembuatan parafrase, analisis ketahanan, dan NLP sumber daya rendah.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.