Wan2vec: Embeddings learned on word association norms

Bel-Enguix, Gemma; Gómez-Adorno, Helena; Reyes-Magaña, Jorge; Sierra, Gerardo

doi:10.3233/sw-190349

Cited by 5 publications

(2 citation statements)

References 32 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Basic types of relations (synonymy, antonymy, hypernymy, meronymy) in the data have been identified and shared tasks on their automatic classification have been run [22]. Various approaches have been also proposed for learning vectors that capture the association relationship between words [23,24,25]. The discussed need for an explanation of word similarity as well as differences between related words links the work to the research on discriminative attribute identification [26] and explanation mechanisms in natural language inference systems [27].…”

Section: Related Workmentioning

confidence: 99%

Crowdsourcing Complex Associations among Words by Means of A Game

Smrž¹

2019

5th International Conference on Computer Science and Information Technology (CSTY 2019)

View full text Add to dashboard Cite

This paper discusses a new approach to creating semantic resources consisting of complex associations among words that can be used for evaluating the content of word embeddings as well as in various language-learning scenarios. We briefly introduce Codenamesan existing party board gameand the way of recording word associations suggested by human players. Advanced word embedding models are then compared on the collected data and it is demonstrated that they often fail in the cases of complex word associations that go beyond simple contextual interchangeability. We conclude with an initial evaluation of the automatic guessing of associated words based on clues provided by human players and a discussion on further extensions of the system towards a wide language coverage and explanations of word associations in the language learning context.

show abstract

Section: Related Workmentioning

confidence: 99%

Crowdsourcing Complex Associations among Words by Means of A Game

Smrž¹

2019

5th International Conference on Computer Science and Information Technology (CSTY 2019)

View full text Add to dashboard Cite

show abstract

“…Gemma et al [20] introduced a fast and efficient word embedding model with the weighted graph from word association norms (WAN). Although this model works well for the low-resource language, building WAN is still a difficult and time-consuming task.…”

Section: Word Embeddings For Low-resource Languagesmentioning

confidence: 99%

Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

2019

View full text Add to dashboard Cite

To overcome the data sparseness in word embedding trained in low-resource languages, we propose a punctuation and parallel corpus based word embedding model. In particular, we generate the global word-pair co-occurrence matrix with the punctuation-based distance attenuation function, and integrate it with the intermediate word vectors generated from the small-scale bilingual parallel corpus to train word embedding. Experimental results show that compared with several widely used baseline models such as GloVe and Word2vec, our model improves the performance of word embedding for low-resource language significantly. Trained on the restricted-scale English-Chinese corpus, our model has improved by 0.71 percentage points in the word analogy task, and achieved the best results in all of the word similarity tasks. Information 2020, 11, 24 2 of 12 and presented the model based on global matrix decomposition. Meanwhile, there is a more widely used word embedding model derived from the neural network model, which is first proposed by Bengio et al. [9] in 2003. Due to the low-efficiency training process of neural network language model (NNLM), Mikolov et al. [10] proposed Word2vec, an efficient open-source word embedding tool, by simplified the N-gram neural network model.Both Word2vec and GloVe can satisfy the basic needs of simple tasks in natural language processing, such as word analogy and word similarity tasks, but perform poorly in the tasks that are oriented to special conditions and fields. There are two ways to improve the performance of word embedding. One is to extract and combine more features from the context, such as morphological features [11], dependency structures [12], knowledge base [13], semantic relations [14]. The other is to combine the language model of large-scale corpus trained from the neural network, such as ELMo [15], GPT [16], Bert [17], XLM [18]. Both the two ways improve the semantic expression of word embedding significantly, yet they need much more extra-resources, including but not limited to the corpus, encyclopedia dictionaries, semantic networks, morphology and dependency syntax analysis tools, and GPU servers. Unfortunately, none of these resources is easily available that it limits the improvement of low-resource language word embedding.In this paper, we optimize the word embedding model for low-resource languages based on the intra-sentence punctuations and an easy-to-obtain bilingual parallel corpus. We first generate the global word-pair co-occurrence matrix, as well as reconstruct GloVe, according to the punctuation-based distance attenuation that is based on the features of punctuation and relative distance. Then, get the intermediate vectors of target language from the word alignment probability and intermediate vectors of parallel language trained with GIZA++ and reconstructed GloVe separately on the bilingual parallel corpus. Finally, constructing the low-resource word embedding model, which is constructed with the global word-pair co-occurrence matrix, the intermediate vect...

show abstract

A Deep Learning–based Approach for Emotions Classification in Big Corpus of Imbalanced Tweets

Jamal

Chen

Al‐Turjman³

et al. 2021

ACM Trans. Asian Low-Resour. Lang. Inf. Process.

View full text Add to dashboard Cite

Emotions detection in natural languages is very effective in analyzing the user's mood about a concerned product, news, topic, and so on. However, it is really a challenging task to extract important features from a burst of raw social text, as emotions are subjective with limited fuzzy boundaries. These subjective features can be conveyed in various perceptions and terminologies. In this article, we proposed an IoT-based framework for emotions classification of tweets using a hybrid approach of Term Frequency Inverse Document Frequency (TFIDF) and deep learning model. First, the raw tweets are filtered using the tokenization method for capturing useful features without noisy information. Second, the TFIDF statistical technique is applied to estimate the importance of features locally as well as globally. Third, the Adaptive Synthetic (ADASYN) class balancing technique is applied to solve the imbalance class issue among different classes of emotions. Finally, a deep learning model is designed to predict the emotions with dynamic epoch curves. The proposed methodology is analyzed on two different Twitter emotions datasets. The dynamic epoch curves are shown to show the behavior of test and train data points. It is proved that this methodology outperformed the popular state-of-the-art methods.

show abstract

Wan2vec: Embeddings learned on word association norms

Cited by 5 publications

References 32 publications

Crowdsourcing Complex Associations among Words by Means of A Game

Crowdsourcing Complex Associations among Words by Means of A Game

Punctuation and Parallel Corpus Based Word Embedding Model for Low-Resource Languages

A Deep Learning–based Approach for Emotions Classification in Big Corpus of Imbalanced Tweets

Contact Info

Product

Resources

About