Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics 2019
DOI: 10.18653/v1/p19-1268
|View full text |Cite
|
Sign up to set email alerts
|

Aiming beyond the Obvious: Identifying Non-Obvious Cases in Semantic Similarity Datasets

Abstract: Existing datasets for scoring text pairs in terms of semantic similarity contain instances whose resolution differs according to the degree of difficulty. This paper proposes to distinguish obvious from non-obvious text pairs based on superficial lexical overlap and ground-truth labels. We characterise existing datasets in terms of containing difficult cases and find that recently proposed models struggle to capture the non-obvious cases of semantic similarity. We describe metrics that emphasise cases of simil… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
9
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
7
1

Relationship

3
5

Authors

Journals

citations
Cited by 11 publications
(10 citation statements)
references
References 14 publications
1
9
0
Order By: Relevance
“…As we show, randomly combining sentences is insufficient. Sampling appropriate pairs has a decisive impact on performance which corresponds to recent findings on similar datasets (Peinelt et al, 2019).…”
Section: Related Worksupporting
confidence: 78%
“…As we show, randomly combining sentences is insufficient. Sampling appropriate pairs has a decisive impact on performance which corresponds to recent findings on similar datasets (Peinelt et al, 2019).…”
Section: Related Worksupporting
confidence: 78%
“…SemEval C) than accuracy. We further report performance on difficult cases with non-obvious F1 score (Peinelt et al, 2019) which identifies challenging instances in the dataset based on lexical overlap and gold labels. Dodge et al (2020) recently showed that early stopping and random seeds can have considerable impact on the performance of finetuned BERT models.…”
Section: Tbert 31 Architecturementioning
confidence: 99%
“…Our main evaluation metric is the F1 score as this is more meaningful than accuracy for datasets with imbalanced label distributions (such as SemEval C, see Appendix A). We also report performance on difficult cases using the non-obvious F1 score (Peinelt et al, 2019). This metric distinguishes obvious from non-obvious instances in a dataset based on lexical overlap and gold labels, and calculates a separate F1 score for challenging cases.…”
Section: Metricsmentioning
confidence: 99%