Josu Goikoetxea scite author profile

Random walks over large knowledge bases like WordNet have been successfully used in word similarity, relatedness and disambiguation tasks. Unfortunately, those algorithms are relatively slow for large repositories, with significant memory footprints. In this paper we present a novel algorithm which encodes the structure of a knowledge base in a continuous vector space, combining random walks and neural net language models in order to produce novel word representations. Evaluation in word relatedness and similarity datasets yields equal or better results than those of a random walk algorithm, using a dense representation (300 dimensions instead of 117K). Furthermore, the word representations are complementary to those of the random walk algorithm and to corpus-based continuous representations, improving the stateof-the-art in the similarity dataset. Our technique opens up exciting opportunities to combine distributional and knowledge-based word representations.

show abstract

Bilingual embeddings with random walks over multilingual wordnets

Goikoetxea¹,

Soroa²,

Agirre³

2018

Knowledge-Based Systems

View full text Add to dashboard Cite

Bilingual word embeddings represent words of two languages in the same space, and allow to transfer knowledge from one language to the other without machine translation. The main approach is to train monolingual embeddings first and then map them using bilingual dictionaries. In this work, we present a novel method to learn bilingual embeddings based on multilingual knowledge bases (KB) such as WordNet. Our method extracts bilingual information from multilingual wordnets via random walks and learns a joint embedding space in one go. We further reinforce cross-lingual equivalence adding bilingual constraints in the loss function of the popular skipgram model. Our experiments involve twelve cross-lingual word similarity and relatedness datasets in six language pairs covering four languages, and show that: 1) random walks over multilingual wordnets improve results over just using dictionaries; 2) multilingual wordnets on their own improve over text-based systems in similarity datasets;3) the good results are consistent for large wordnets (e.g. English, Spanish), smaller wordnets (e.g. Basque) or loosely aligned wordnets (e.g. Italian); 4) the combination of wordnets and text yields the best results, above mapping-based approaches. Our method can be applied to richer KBs like DBpedia or Babel-Net, and can be easily extended to multilingual embeddings. All software and resources are open source.

show abstract

A large reproducible benchmark of ontology-based methods and word embeddings for word similarity

Lastra-Díaz

Goikoetxea²,

Taieb

et al. 2021

Information Systems

View full text Add to dashboard Cite

Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet

Goikoetxea

Agirre

Soroa

2016

AAAI

View full text Add to dashboard Cite

Text and Knowledge Bases are complementary sources of information. Given the success of distributed word representations learned from text, several techniques to infuse additional information from sources like WordNet into word representations have been proposed. In this paper, we follow an alternative route. We learn word representations from text and WordNet independently, and then explore simple and sophisticated methods to combine them. The combined representations are applied to an extensive set of datasets on word similarity and relatedness. Simple combination methods happen to perform better that more complex methods like CCA or retrofitting, showing that, in the case of WordNet, learning word representations separately is preferable to learning one single representation space or adding WordNet information directly. A key factor, which we illustrate with examples, is that the WordNet-based representations captures similarity relations encoded in WordNet better than retrofitting. In addition, we show that the average of the similarities from six word representations yields results beyond the state-of-the-art in several datasets, reinforcing the opportunities to explore further combination techniques.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Josu Goikoetxea

A reproducible survey on word embeddings and ontology-based methods for word similarity: Linear combinations outperform the state of the art

Random Walks and Neural Network Language Models on Knowledge Bases

Bilingual embeddings with random walks over multilingual wordnets

A large reproducible benchmark of ontology-based methods and word embeddings for word similarity

Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet

Contact Info

Product

Resources

About