Textual Membership Queries

Zarecki, Jonathan; Markovitch, Shaul

doi:10.24963/ijcai.2020/369

Cited by 6 publications

(6 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Another well-known method is generation of new text using RNN [30] or GPT [12,18,36,42] or GAN [3]. In [44], they applied dependency based embeddings for word substitution to generate text while leveraging textual membership queries. Some other studies leveraged solutions such as shifting the position of words in a zero padded representation [30], synthetic minority oversampling, random over-and under-sampling, and AdaSYN [29], adding common misspelling of words to data and collecting tweets that contain swear words in conjunction with positive adjectives or racial and religious tweets [38], adding tweets with disgust and anger emotions from suspended accounts to the data [1], bootstrapping from another dataset; embedding based [13] or sentiment polarity based [8] methods.…”

Section: Related Studiesmentioning

confidence: 99%

Token replacement-based data augmentation methods for hate speech detection

2022

View full text Add to dashboard Cite

Hate speech detection mostly involves the use of text data. This data, usually sourced from various social media platforms, have been known to be plagued with numerous issues that result in a reduction of its quality and hence, the quality of the trained models. Some of these issues are the lack of diversity and the diminutive class of interest in the dataset which results in overfitted models that do not generalize well on other or newly collected data. The different ways of handling these issues include augmenting the data with diverse samples, engineering non-redundant features or designing robust classification models. In this study, the focus is on the data augmentation aspect. Data augmentation is a popular method for improving the quality of existing datasets by generating synthetic samples that mimic the distribution of the original samples. There is a lack of extensive studies on how hate speech texts respond to varying textual data augmentation techniques and methods. Specifically, we provide further insight into the token replacement method of textual data augmentation by performing empirical studies that investigate which embedding method(s) is a robust source of synonym for replacement process, what effective method(s) can be used to select words to be replaced, and how to confirm if the label within each class is preserved. Our proposed methods, validated on two commonly used hate speech datasets affected by a known lack of diversity and diminutive class of interest issues, significantly improve classification performance and provides insights into token replacement methods.

show abstract

Section: Related Studiesmentioning

confidence: 99%

Token replacement-based data augmentation methods for hate speech detection

2022

View full text Add to dashboard Cite

show abstract

“…This also implies that we need a labeller or oracle O, since newly generated points do not come with a label. For experimentation, however, the oracle can be replaced by a classifier [14]. Two more important things that need to be taken care of are when to stop, and how many initial labelled data to use, which we denote as the seed.…”

Section: Membership Query Synthesis and Related Workmentioning

confidence: 99%

“…The only other work we have found on textual MQS is the one of [14], where sentences are edited to perturb the examples and create new ones. To do so, words are replaced by semantically near substitutes, and these sentences are then labelled to see if the replacement made them change the class.…”

Section: Membership Query Synthesis and Related Workmentioning

confidence: 99%

“…Because labelling becomes rapidly expensive when running experiments, we replace what would normally be a human by a classifier, as was done in [14]. We pick BERT due to the excellent performances it has shown in recent years, but in order to get it closer to human performance, we fine-tune it on all labelled data from both the train and test set.…”

Section: Experimental Protocolmentioning

confidence: 99%

See 1 more Smart Citation

A working model for textual Membership Query Synthesis

Piedboeuf¹,

Langlais²

2022

Proceedings of the Canadian Conference on Artificial Intelligence

View full text Add to dashboard Cite

Membership Query Synthesis (MQS) is an active learning paradigm in which one labels generated artificial examples instead of genuine ones to extend a dataset. Despite prodigious advances in the power of generative models, an essential component of MQS, the field stays severely under-studied, especially in the textual domain. We found only one other paper, which selects examples in a latent space close to the decision boundary and shows good results on a curated dataset of short sentences. We show that this performs poorly when used on a real dataset. We propose and report better results than random selection of unlabelled genuine data with random generation of artificial data from a variational auto-encoder coupled with a simple set of filtering mechanisms. This provides an improvement of 31.1% over the previous MQS state-of-the-art on the SST-2 dataset, and of 2.7% over random active learning. To the best of our knowledge, this is the first time MQS is reported to work on a textual task with no constraint on the size of the input sentences.

show abstract

“…Several works have proposed to perturbe the original example in the feature space [5][6][7]. While these methods are of great use in some domains, they may be inapplicable in other important domains, such as computer vision and NLP, where points in the feature space are incomprehensible to humans [29].…”

Section: Plausibiliy Of Generated Counterfactualsmentioning

confidence: 99%

Anytime Generation of Counterfactual Explanations for Text Classification

Gilo¹,

Markovitch²

2022

Preprint

View full text Add to dashboard Cite

In many machine learning applications, it is important for the user to understand the reasoning behind the recommendation or prediction of the classifiers. The learned models, however, are often too complicated to be understood by a human. Research from the social sciences indicates that humans prefer counterfactual explanations over alternatives. In this paper, we present a general framework for generating counterfactual explanations in the textual domain.Our framework is model-agnostic, representation-agnostic, domain-agnostic, and anytime. We model the task as a search problem in a space where the initial state is the classified text, and the goal state is a text in the complementary class. The operators transform a text by replacing parts of it. Our framework includes domain-independent operators, but can also exploit domainspecific knowledge through specialized operators. The search algorithm attempts to find a text from the complementary class with minimal word-level Levenshtein distance from the original classified object.

show abstract

Textual Membership Queries

Cited by 6 publications

References 22 publications

Token replacement-based data augmentation methods for hate speech detection

Token replacement-based data augmentation methods for hate speech detection

A working model for textual Membership Query Synthesis

Anytime Generation of Counterfactual Explanations for Text Classification

Contact Info

Product

Resources

About