2019
DOI: 10.1016/j.jbi.2019.103096
|View full text |Cite
|
Sign up to set email alerts
|

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings

Abstract: Neural embeddings are a popular set of methods for representing words, phrases or text as a low dimensional vector (typically 50-500 dimensions). However, it is difficult to interpret these dimensions in a meaningful manner, and creating neural embeddings requires extensive training and tuning of multiple parameters and hyperparameters. We present here a simple unsupervised method for representing words, phrases or text as a low dimensional vector, in which the meaning and relative importance of dimensions is … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
14
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
6
1
1

Relationship

2
6

Authors

Journals

citations
Cited by 22 publications
(14 citation statements)
references
References 34 publications
0
14
0
Order By: Relevance
“…With the growing amount of biomedical information available in textual form, there have been significant advances in the development of pretraining language representations that can be applied to a range of different tasks in the biomedical domain, such as pre-trained word embeddings, sentence embeddings, and contextual representations (Chiu et al, 2016;Peters et al, 2017;Smalheiser et al, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…With the growing amount of biomedical information available in textual form, there have been significant advances in the development of pretraining language representations that can be applied to a range of different tasks in the biomedical domain, such as pre-trained word embeddings, sentence embeddings, and contextual representations (Chiu et al, 2016;Peters et al, 2017;Smalheiser et al, 2019).…”
Section: Introductionmentioning
confidence: 99%
“…Next, for each article in PubMed, we assign a feature set that includes metadata features extracted from the PubMed XML record (or computed from information contained in the record), that we know (or suspect) may provide information that will help in assigning PTs. The feature set includes a variety of textual features – for example, words that appear in the title and/or in the abstract, as well as low-dimensional vector representations of these words (e. g., implicit term metrics 16 or word2vec neural embeddings 16,17 ). The feature set also includes journal name (since publication types are not distributed equally across journals), Medical Subject Headings, and other features such as number of authors listed on the article (note that reviews are often single authored whereas clinical trials generally have many author names on each paper).…”
Section: Resultsmentioning
confidence: 99%
“…Rather than distributing the code and back-end databases to users, both of which are quite large and complex, and cumbersome to distribute and get running at another site, it is much more efficient to simply provide users with the end results. Indeed, our laboratory has created a suite of such pre-computed resources that are freely available online for viewing or download (http://arrowsmith.psych.uic.edu) 10–16 .…”
Section: Introductionmentioning
confidence: 99%
“…In addition to LSA, there have also been several recent advances in conceptual analysis options, perhaps the most notable being Google's word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). Though powerful and fairly easy to implement with specialized packages (e.g., the Gensim library; Rehurek & Sojka, 2010), these new methods still suffer in part from a crucial drawback shared with LSA, in that the embeddings used to assess semantic similarity are highdimensional mathematical spaces whose intrinsic meaning can be challenging to apprehend (Smalheiser & Bonifield, 2018). Though there has been research into techniques that attempt to address this issue (e.g., Luo, Liu, Luan, & Sun, 2015;Park, Bak, & Oh, 2017), generally these approaches make both the interpretation of the dimensions of the semantic space and understanding of the influence of specific keywords difficult.…”
Section: Quantifying Semantic Contentmentioning
confidence: 99%
“…Though there has been research into techniques that attempt to address this issue (e.g., Luo, Liu, Luan, & Sun, 2015;Park, Bak, & Oh, 2017), generally these approaches make both the interpretation of the dimensions of the semantic space and understanding of the influence of specific keywords difficult. Further, though some of the simplicity of using word2vec comes from using pretrained embeddings, these spaces may not be optimal for particular applications, and training new embeddings can present several challenges (Smalheiser & Bonifield, 2018). CRA offers an enticing alternative that addresses these concerns.…”
Section: Quantifying Semantic Contentmentioning
confidence: 99%