Proceedings of the 15th Workshop on Biomedical Natural Language Processing 2016
DOI: 10.18653/v1/w16-2922
|View full text |Cite
|
Sign up to set email alerts
|

How to Train good Word Embeddings for Biomedical NLP

Abstract: The quality of word embeddings depends on the input corpora, model architectures, and hyper-parameter settings. Using the state-of-the-art neural embedding tool word2vec and both intrinsic and extrinsic evaluations, we present a comprehensive study of how the quality of embeddings changes according to these features. Apart from identifying the most influential hyper-parameters, we also observe one that creates contradictory results between intrinsic and extrinsic evaluations. Furthermore, we find that bigger c… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

12
227
1

Year Published

2016
2016
2020
2020

Publication Types

Select...
4
3
2
1

Relationship

1
9

Authors

Journals

citations
Cited by 289 publications
(240 citation statements)
references
References 14 publications
12
227
1
Order By: Relevance
“…However, we found that the majority of word similarity datasets fail to predict which representations will be successful in sequence labelling tasks, with only one intrinsic measure, SimLex-999, showing high correlation with extrinsic measures. In concurrent work, we have also observed a similar effect for biomedical domain tasks and word vectors (Chiu et al, 2016). We further considered the differentiation between relatedness (association) and similarity (synonymy) as an explanatory factor, noting that the majority of intrinsic evaluation datasets do not systematically make this distinction.…”
Section: Resultsmentioning
confidence: 66%
“…However, we found that the majority of word similarity datasets fail to predict which representations will be successful in sequence labelling tasks, with only one intrinsic measure, SimLex-999, showing high correlation with extrinsic measures. In concurrent work, we have also observed a similar effect for biomedical domain tasks and word vectors (Chiu et al, 2016). We further considered the differentiation between relatedness (association) and similarity (synonymy) as an explanatory factor, noting that the majority of intrinsic evaluation datasets do not systematically make this distinction.…”
Section: Resultsmentioning
confidence: 66%
“…Before that, we count the frequency of occurrence of each word in the data-set, and use this word frequency to create a dictionary, then express each word in terms of frequency order of corresponding word (Kim, 2014). Next, we train word embeddding according to (Chiu et al, 2016), and at the same time download the trained embedding sets that have been trained 2 .…”
Section: System Descriptionmentioning
confidence: 99%
“…Chiu et. al [57]. The embedding for unknown words was initialized from a uniform (− , ) distribution, where was determined such that the unknown vectors have approximately the same variance as that of pre-trained data [58].…”
Section: Machine Learning Based Computational Modelsmentioning
confidence: 99%