Can vectors read minds better than experts? Comparing data augmentation strategies for the automated scoring of children’s mindreading ability

Kovatchev, Venelin; Smith, Phillip; Lee, Mark; Devine, Rory T.

doi:10.18653/v1/2021.acl-long.96

Cited by 8 publications

(8 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This augmentation adds noise in the form of colloquial filler phrases. 23 di erent phrases are chosen across 3 di erent categories: general filler words and phrases ("uhm", "err", "actually", "like", "you know"...), phrases emphasizing speaker opinion/mental state ("I think/believe/mean", "I would say"...) & phrases indicating uncertainty ("maybe", "perhaps", "probably", "possibly", "most likely").The la er two categories had shown promising results Kovatchev et al (2021) when they were concatenated at the beginning of the sentence unlike this implementation which perform insertions at any random positions. Filler words are based on the work of Laserna et al (2014) but have not been explored in the context of data augmentation.…”

Section: A40 Filler Word Augmentationmentioning

confidence: 99%

NL-Augmenter 🦎 → 🐍 A Framework for Task-Sensitive Natural Language Augmentation

Dhole

Gangal²,

Gehrmann³

et al. 2023

NEJLT

View full text Add to dashboard Cite

Data augmentation is an important method for evaluating the robustness of and enhancing the diversity of training data for natural language processing (NLP) models. In this paper, we present NL-Augmenter, a new participatory Python-based natural language (NL) augmentation framework which supports the creation of transformations (modifications to the data) and filters (data splits according to specific features). We describe the framework and an initial set of 117 transformations and 23 filters for a variety of NL tasks annotated with noisy descriptive tags. The transformations incorporate noise, intentional and accidental human mistakes, socio-linguistic variation, semantically-valid style, syntax changes, as well as artificial constructs that are unambiguous to humans. We demonstrate the efficacy of NL-Augmenter by using its transformations to analyze the robustness of popular language models. We find different models to be differently challenged on different tasks, with quasi-systematic score decreases. The infrastructure, datacards, and robustness evaluation results are publicly available on GitHub for the benefit of researchers working on paraphrase generation, robustness analysis, and low-resource NLP. El aumento de datos es un método importante para evaluar la solidez y mejorar la diversidad del entrenamiento datos para modelos de procesamiento de lenguaje natural (NLP). इस लेख में, हम एनएल-ऑगमेंटर का प्रस्ताव करते हैं - एक नया भागी- दारी पूर्वक, पायथन में बनाया गया, लैंग्वेज (एनएल) ऑग्मेंटेशन फ्रेमवर्क जो ट्रांसफॉर्मेशन (डेटा में बदलाव करना) और फीलटर (फीचर्स के अनुसार डेटा का भाग करना) के नीरमान का समर्थन करता है।. 我们描述了NL-Augmenter框架及其初步包含的117种转换和23个过滤器，并大致标注分类了一系列可适配的自然语言任务. این دگرگونی ها شامل نویز، اشتباهات عمدی و تصادفی انسانی، تنوع اجتماعی-زبانی، سبک معنایی معتبر، تغییرات نحوی و همچنین ساختارهای مصنوعی است که برای انسان ها مبهم است. NL-Augmenterpa allin kaynintam qawachiyku, tikrakuyninku- nata servichikuspayku, chaywanmi qawariyku modelos de lenguaje popular nisqapa allin takyasqa kayninta. Kami menemukan model yang berbeda ditantang secara berbeda pada tugas yang berbeda, dengan penurunan skor kuasi-sistematis. Infrastruktur, kartu data, dan hasil evaluasi ketahanan dipublikasikan tersedia secara gratis di GitHub untuk kepentingan para peneliti yang mengerjakan pembuatan parafrase, analisis ketahanan, dan NLP sumber daya rendah.

show abstract

Section: A40 Filler Word Augmentationmentioning

confidence: 99%

NL-Augmenter 🦎 → 🐍 A Framework for Task-Sensitive Natural Language Augmentation

Dhole

Gangal²,

Gehrmann³

et al. 2023

NEJLT

View full text Add to dashboard Cite

show abstract

“…Data Generation Recent years have seen an increasing interest in the use of data generation and data augmentation for various NLP tasks (Liu, Swayamdipta, Smith and Choi, 2022;Hartvigsen, Gabriel, Palangi, Sap, Ray and Kamar, 2022;Dhole, Gangal, Gehrmann, Gupta, Li, Mahamood, Mahendiran, Mille, Srivastava, Tan et al, 2021;Kovatchev, Smith, Lee and Devine, 2021). The use of synthetic data has not been extensively explored in the context of factchecking.…”

Section: Challengesmentioning

confidence: 99%

The State of Human-centered NLP Technology for Fact-checking

Das,

Liu,

Kovatchev

et al. 2023

Preprint

View full text Add to dashboard Cite

Misinformation threatens modern society by promoting distrust in science, changing narratives in public health, heightening social polarization, and disrupting democratic elections and financial markets, among a myriad of other societal harms. To address this, a growing cadre of professional fact-checkers and journalists provide high-quality investigations into purported facts. However, these largely manual efforts have struggled to match the enormous scale of the problem. In response, a growing body of Natural Language Processing (NLP) technologies have been proposed for more scalable fact-checking. Despite tremendous growth in such research, however, practical adoption of NLP technologies for fact-checking still remains in its infancy today.In this work, we review the capabilities and limitations of the current NLP technologies for fact-checking. Our particular focus is to further chart the design space for how these technologies can be harnessed and refined in order to better meet the needs of human fact-checkers. To do so, we review key aspects of NLP-based fact-checking: task formulation, dataset construction, modeling, and human-centered strategies, such as explainable models and human-in-the-loop approaches. Next, we review the efficacy of applying NLP-based fact-checking tools to assist human fact-checkers. We recommend that future research include collaboration with fact-checker stakeholders early on in NLP research, as well as incorporation of human-centered design practices in model development, in order to further guide technology development for human use and practical adoption. Finally, we advocate for more research on benchmark development supporting extrinsic evaluation of human-centered fact-checking technologies.

show abstract

“…The in-domain dictionary contains all keywords identified by Chapparo et al and maps each keyword to its substitutes, e.g., affect → {break, block, close, ...}. Domain knowledge guided operators have been recently shown to lead to better performance compared to more advanced but general approaches (e.g., embeddings) [25].…”

Section: Natural Language Da Operatorsmentioning

confidence: 99%

Fast changeset-based bug localization with BERT

Ciborowska

Damevski

2022

Proceedings of the 44th International Conference on Software Engineering

View full text Add to dashboard Cite

Modern Deep Learning (DL) architectures based on transformers (e.g., BERT, RoBERTa) are exhibiting performance improvements across a number of natural language tasks. While such DL models have shown tremendous potential for use in software engineering applications, they are often hampered by insufficient training data. Particularly constrained are applications that require projectspecific data, such as bug localization, which aims at recommending code to fix a newly submitted bug report. Deep learning models for bug localization require a substantial training set of fixed bug reports, which are at a limited quantity even in popular and actively developed software projects. In this paper, we examine the effect of using synthetic training data on transformer-based DL models that perform a more complex variant of bug localization, which has the goal of retrieving bug-inducing changesets for each bug report. To generate high-quality synthetic data, we propose novel data augmentation operators that act on different constituent components of bug reports. We also describe a data balancing strategy that aims to create a corpus of augmented bug reports that better reflects the entire source code base, because existing bug reports used as training data usually reference a small part of the code base. Data balancing helps the model perform better for newlyreported bug reports that reference previously unobserved code. Our evaluation results indicate that both data augmentation and balancing are effective, improving retrieval performance across all three BERT-based models we studied.

show abstract

Can vectors read minds better than experts? Comparing data augmentation strategies for the automated scoring of children’s mindreading ability

Cited by 8 publications

References 24 publications

NL-Augmenter 🦎 → 🐍 A Framework for Task-Sensitive Natural Language Augmentation

NL-Augmenter 🦎 → 🐍 A Framework for Task-Sensitive Natural Language Augmentation

The State of Human-centered NLP Technology for Fact-checking

Fast changeset-based bug localization with BERT

Contact Info

Product

Resources

About