Generating Label Cohesive and Well-Formed Adversarial Claims

Atanasova, Pepa; Wright, Dustin; Augenstein, Isabelle

doi:10.18653/v1/2020.emnlp-main.256

Cited by 32 publications

(34 citation statements)

References 26 publications

(20 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The inter-annotator agreement (computed with Cohen's kappa [7]) between the annotators is 0.47, which signals "moderate" agreement [24]. This is comparable to the inter-annotator agreement in Atanasova et al [1], where claims generated with GPT-2 were annotated for semantic coherence. Table 10 shows the Cohen's kappa for each dataset separately.…”

Section: Example #4supporting

confidence: 51%

“…While much recent work in adversarial attacks aims to break NLI systems and is especially adapted to this problem [13,29], these stress tests have been applied to several other tasks, e.g. Question-Answering [49], Machine Translation [4], or Fact Checking [1,44]. Unfortunately, preserving the semantics of a sentence while automatically generating these adversarial attacks is difficult, which is why some works have defined small stress tests manually [19,27].…”

Section: Datasetmentioning

confidence: 99%

“…Due to the high subjectivity of this task, the annotation was conducted by two human annotators; the first author and a postdoctoral researcher with background in natural language processing (not involved in this work). The inter-annotator agreement was computed with Cohen's Kappa [7] and signals "moderate" agreement [24] with = 0.47 (see Appendix 2 for more information about the annotation process), which is comparable to the inter-annotator agreement in Atanasova et al [1], where claims generated with GPT-2 were annotated for semantic coherence. The percentage of samples annotated as "semantically equivalent" is 48.4% (average of both annotators), resulting in a correctness ratio c a of 0.484 for the paraphrase attack.…”

Section: Metrics (Original)mentioning

confidence: 99%

“…While these differences are problematic for cross-domain performance, it can also be seen as an advantage, as it concludes in an abundance of datasets from different domains that can be integrated into transfer or multi-task learning approaches. Yet, given the decent human performance on this task, it is hard to grasp why ML models fall short of StD, while they are almost on par for related tasks like Sentiment Analysis 1 and Natural Language Inference 2 (NLI).…”

Section: Introductionmentioning

confidence: 99%

“…The contributions of this paper are as follows: (1) to the best of our knowledge, within the field of StD we are the first to combine learning from related tasks (via TL) and MDL, designed to capture all facets of StD tasks, and achieve new state-of-the-art results on five of ten datasets. (2) In an indepth analysis with adversarial attacks, we show that TL and MDL for StD generally improves the performance of ML models, but also drastically reduces their robustness if compared to SDL models.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Stance Detection Benchmark: How Robust is Your Stance Detection?

2021

View full text Add to dashboard Cite

Stance detection (StD) aims to detect an author’s stance towards a certain topic and has become a key component in applications like fake news detection, claim validation, or argument search. However, while stance is easily detected by humans, machine learning (ML) models are clearly falling short of this task. Given the major differences in dataset sizes and framing of StD (e.g. number of classes and inputs), ML models trained on a single dataset usually generalize poorly to other domains. Hence, we introduce a StD benchmark that allows to compare ML models against a wide variety of heterogeneous StD datasets to evaluate them for generalizability and robustness. Moreover, the framework is designed for easy integration of new datasets and probing methods for robustness. Amongst several baseline models, we define a model that learns from all ten StD datasets of various domains in a multi-dataset learning (MDL) setting and present new state-of-the-art results on five of the datasets. Yet, the models still perform well below human capabilities and even simple perturbations of the original test samples (adversarial attacks) severely hurt the performance of MDL models. Deeper investigation suggests overfitting on dataset biases as the main reason for the decreased robustness. Our analysis emphasizes the need of focus on robustness and de-biasing strategies in multi-task learning approaches. To foster research on this important topic, we release the dataset splits, code, and fine-tuned weights.

show abstract

Section: Example #4supporting

confidence: 51%