Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP 2016
DOI: 10.18653/v1/w16-2502
|View full text |Cite
|
Sign up to set email alerts
|

A critique of word similarity as a method for evaluating distributional semantic models

Abstract: This paper aims to re-think the role of the word similarity task in distributional semantics research. We argue while it is a valuable tool, it should be used with care because it provides only an approximate measure of the quality of a distributional model. Word similarity evaluations assume there exists a single notion of similarity that is independent of a particular application. Further, the small size and low inter-annotator agreement of existing data sets makes it challenging to find significant differen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

1
43
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
4
4
2

Relationship

0
10

Authors

Journals

citations
Cited by 44 publications
(44 citation statements)
references
References 17 publications
1
43
0
Order By: Relevance
“…1 again). Finally, unlike with SimLex-999 or MEN scores where it is difficult to interpret "what a similarity/relatedness of 7.69 exactly means" (Batchkarov et al, 2016;Avraham and Goldberg, 2016), the USF FSG scores have a direct meaningful interpretation (i.e., F SG = #P/#G). To fully capture all aspects of the ground truth USF data set, an evaluation protocol should ideally be based not only on response rankings, but also on the actual scores, i.e., the association strength.…”
Section: Introductionmentioning
confidence: 99%
“…1 again). Finally, unlike with SimLex-999 or MEN scores where it is difficult to interpret "what a similarity/relatedness of 7.69 exactly means" (Batchkarov et al, 2016;Avraham and Goldberg, 2016), the USF FSG scores have a direct meaningful interpretation (i.e., F SG = #P/#G). To fully capture all aspects of the ground truth USF data set, an evaluation protocol should ideally be based not only on response rankings, but also on the actual scores, i.e., the association strength.…”
Section: Introductionmentioning
confidence: 99%
“…Besides the identification of each LKB, the relation set and the algorithm used are revealed for each result, plus the mean Spearman correlation (ρ) and the standard deviation (σ). The latter were computed as suggested by Batchkarov et al [42], who criticise how word similarity tests are used for assessing similarity models. What happens is that even the largest test, RareWords, is too small for taking conclusions about the performance of a broad-coverage resource, that aims at covering the whole language.…”
Section: Results For Lkbsmentioning
confidence: 99%
“…Recently, the use of word similarity methods has been criticised as a reliable technique for evaluating distributional semantic models (Batchkarov et al, 2016), given the small size of the datasets and the limitation of context information as well. However, given this procedure still is widely accepted, we have performed two different kinds of experiments: rating by similarity and synonym detection with multiple-choice questions.…”
Section: Word Similaritymentioning
confidence: 99%