Finding similar words with the help of word embedding models has yielded meaningful results in many cases. However, the notion of similarity has remained ambiguous. In this paper, we examine when exactly similarity values in word embedding models are meaningful.To do so, we analyze the statistical distribution of similarity values systematically, in two series of experiments. The first one examines how the distribution of similarity values depends on the different embedding-model algorithms and parameters. The second one starts by showing that intuitive similarity thresholds do not exist. We then propose a method stating which similarity values actually are meaningful for a given embedding model. In more abstract terms, our insights should give way to a better understanding of the notion of similarity in embedding models and to more reliable evaluations of such models.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.