On the Various Semantics of Similarity in Word Embedding Models

Elekes, Abel; Schaeler, Martin; Boehm, Klemens

doi:10.1109/jcdl.2017.7991568

Cited by 23 publications

(9 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…In this paper, we opt for the former, therefore we set the values as follows: embedding vector = 300, min_word_count = 10, window_size = 5, number_workers = 5, dict_size = 100,000. These values have been established to be a reasonable baseline setting for various semantic similarity tasks (Elekes et al , 2017, 2018; Hill et al , 2015). For the cosine similarity transformation function parameters ( a and c ) used in our second algorithm, we set them using cross-validation as a = {20, 30, …, 60} and for c = {0.9, 0.85, …, 0.7}.…”

Section: Methodsmentioning

confidence: 99%

Embedding based learning for collection selection in federated search

Garba

Khalid

Ullah

et al. 2020

DTA

View full text Add to dashboard Cite

PurposeThere have been many challenges in crawling deep web by search engines due to their proprietary nature or dynamic content. Distributed Information Retrieval (DIR) tries to solve these problems by providing a unified searchable interface to these databases. Since a DIR must search across many databases, selecting a specific database to search against the user query is challenging. The challenge can be solved if the past queries of the users are considered in selecting collections to search in combination with word embedding techniques. Combining these would aid the best performing collection selection method to speed up retrieval performance of DIR solutions.Design/methodology/approachThe authors propose a collection selection model based on word embedding using Word2Vec approach that learns the similarity between the current and past queries. They used the cosine and transformed cosine similarity models in computing the similarities among queries. The experiment is conducted using three standard TREC testbeds created for federated search.FindingsThe results show significant improvements over the baseline models.Originality/valueAlthough the lexical matching models for collection selection using similarity based on past queries exist, to the best our knowledge, the proposed work is the first of its kind that uses word embedding for collection selection by learning from past queries.

show abstract

Section: Methodsmentioning

confidence: 99%

Embedding based learning for collection selection in federated search

Garba

Khalid

Ullah

et al. 2020

DTA

View full text Add to dashboard Cite

show abstract

“…Note that the results can be transferred to the skip-gram learning algorithm of Word2Vec as well as to Glove (Pennington et al, 2014;Levy and Goldberg, 2014). This is because very recent work has shown that the distribution of the cosine distances among the word vectors of all such models (after normalization) is highly similar (Elekes et al, 2017).…”

Section: Realizations Of Word Embedding Modelsmentioning

confidence: 99%

“…There is related work that studies the impact of these parameters (Baroni et al, 2014;Hill et al, 2014;Elekes et al, 2017). Consequently, for all parameters mentioned except for the window size, we can rely on the results from the literature.…”

Section: Building a Word Embedding Modelmentioning

confidence: 99%

Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data

Elekes¹,

Englhardt²,

Schäler³

et al. 2018

Proceedings of the 22nd Conference on Computational Natural Language Learning

View full text Add to dashboard Cite

Word embeddings are powerful tools that facilitate better analysis of natural language. However, their quality highly depends on the resource used for training. There are various approaches relying on n-gram corpora, such as the Google n-gram corpus. However, ngram corpora only offer a small window into the full text-5 words for the Google corpus at best. This gives way to the concern whether the extracted word semantics are of high quality. In this paper, we address this concern with two contributions. First, we provide a resource containing 120 word-embedding models-one of the largest collection of embedding models. Furthermore, the resource contains the ngramed versions of all used corpora, as well as our scripts used for corpus generation, model generation and evaluation. Second, we define a set of meaningful experiments allowing to evaluate the aforementioned quality differences. We conduct these experiments using our resource to show its usage and significance. The evaluation results confirm that one generally can expect high quality for n-grams with n ≥ 3.

show abstract

“…The training process additionally requires setting the values for the hyperparameters in such algorithms, particularly the vector size and context window size. In the literature on word-embeddings models (Elekes et al, 2017;Pennington et al, 2014) and applications (Banerjee et al, 2018;Kuzi et al, 2016; S. Li et al, 2018;Risch & Krestel, 2019), researchers normally experimented different values for such parameters and determined the values according to specific contexts and needs.…”

Section: Term Vectorizationmentioning

confidence: 99%

TechNet: Technology semantic network based on patent data

Sarica

Luo

Wood

2020

Expert Systems with Applications

133

View full text Add to dashboard Cite

The growing developments in general semantic networks, knowledge graphs and ontology databases have motivated us to build a large-scale comprehensive semantic network of technology-related data for engineering knowledge discovery, technology search and retrieval, and artificial intelligence for engineering design and innovation. Specially, we constructed a technology semantic network (TechNet) that covers the elemental concepts in all domains of technology and their semantic associations by mining the complete U.S. patent database from 1976. To derive the TechNet, natural language processing techniques were utilized to extract terms from massive patent texts and recent word embedding algorithms were employed to vectorize such terms and establish their semantic relationships. We report and evaluate the TechNet for retrieving terms and their pairwise relevance that is meaningful from a technology and engineering design perspective. The TechNet may serve as an infrastructure to support a wide range of applications, e.g., technical text summaries, search query predictions, relational knowledge discovery, and design ideation support, in the context of engineering and technology, and complement or enrich existing semantic databases. To enable such applications, we made the TechNet public via an online interface and APIs for public users to retrieve technologyrelated terms and their relevancies.

show abstract

On the Various Semantics of Similarity in Word Embedding Models

Cited by 23 publications

References 30 publications

Embedding based learning for collection selection in federated search

Embedding based learning for collection selection in federated search

Resources to Examine the Quality of Word Embedding Models Trained on n-Gram Data

TechNet: Technology semantic network based on patent data

Contact Info

Product

Resources

About