Word embeddings are real-valued word representations able to capture lexical semantics and trained on natural language corpora. Models proposing these representations have gained popularity in the recent years, but the issue of the most adequate evaluation method still remains open. This paper presents an extensive overview of the field of word embeddings evaluation, highlighting main problems and proposing a typology of approaches to evaluation, summarizing 16 intrinsic methods and 12 extrinsic methods. I describe both widely-used and experimental methods, systematize information about evaluation datasets and discuss some key challenges.
Absence of correlation between intrinsic and extrinsic methods.Performance scores of word embeddings, when measured with two existing evaluation approaches (intrinsic and extrinsic), do not correlate between themselves. It is unclear what class of methods is more adequate. 4. Lack of significance tests. Statistical significance tests are sometimes not performed in the key experiments with new distributional models and evaluation methods. Thus, certain results of evaluation proposed in certain papers are not as correct as it is desirable. 5. The hubness problem. It is unclear how to deal with so-called hubs which are word vectors representing very frequent words. Such vectors are close to a disproportionately large number of other word vectors, hence, cosine distances between any two word vectors would probably be noised by the hubs, and the any evaluation in this case is biased.