Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings

Smalheiser, Neil R.; Cohen, Aaron M.; Bonifield, Gary

doi:10.1016/j.jbi.2019.103096

Cited by 22 publications

(14 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…With the growing amount of biomedical information available in textual form, there have been significant advances in the development of pretraining language representations that can be applied to a range of different tasks in the biomedical domain, such as pre-trained word embeddings, sentence embeddings, and contextual representations (Chiu et al, 2016;Peters et al, 2017;Smalheiser et al, 2019).…”

Section: Introductionmentioning

confidence: 99%

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Peng

Yan

2019

Proceedings of the 18th BioNLP Workshop and Shared Task

618

462

View full text Add to dashboard Cite

Inspired by the success of the General Language Understanding Evaluation benchmark, we introduce the Biomedical Language Understanding Evaluation (BLUE) benchmark to facilitate research in the development of pre-training language representations in the biomedicine domain. The benchmark consists of five tasks with ten datasets that cover both biomedical and clinical texts with different dataset sizes and difficulties. We also evaluate several baselines based on BERT and ELMo and find that the BERT model pre-trained on PubMed abstracts and MIMIC-III clinical notes achieves the best results. We make the datasets, pre-trained models, and codes publicly available at https://github.com/ ncbi-nlp/BLUE_Benchmark.

show abstract

Section: Introductionmentioning

confidence: 99%

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Peng

Yan

2019

Proceedings of the 18th BioNLP Workshop and Shared Task

618

462

View full text Add to dashboard Cite

show abstract

“…Next, for each article in PubMed, we assign a feature set that includes metadata features extracted from the PubMed XML record (or computed from information contained in the record), that we know (or suspect) may provide information that will help in assigning PTs. The feature set includes a variety of textual features – for example, words that appear in the title and/or in the abstract, as well as low-dimensional vector representations of these words (e. g., implicit term metrics 16 or word2vec neural embeddings 16,17 ). The feature set also includes journal name (since publication types are not distributed equally across journals), Medical Subject Headings, and other features such as number of authors listed on the article (note that reviews are often single authored whereas clinical trials generally have many author names on each paper).…”

Section: Resultsmentioning

confidence: 99%

“…Rather than distributing the code and back-end databases to users, both of which are quite large and complex, and cumbersome to distribute and get running at another site, it is much more efficient to simply provide users with the end results. Indeed, our laboratory has created a suite of such pre-computed resources that are freely available online for viewing or download (http://arrowsmith.psych.uic.edu) 10–16 .…”

Section: Introductionmentioning

confidence: 99%

Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database

Smalheiser¹,

Cohen²

2018

Data and Information Management

Self Cite

View full text Add to dashboard Cite

Many investigators have carried out text mining of the biomedical literature for a variety of purposes, ranging from the assignment of indexing terms to the disambiguation of author names. A common approach is to define positive and negative training examples, extract features from article metadata, and employ machine learning algorithms. At present, each research group tackles each problem from scratch, and in isolation of other projects, which causes redundancy and great waste of effort. Here, we propose and describe the design of a generic platform for biomedical text mining, which can serve as a shared resource for machine learning projects, and can serve as a public repository for their outputs. We will initially focus on a specific goal, namely, classifying articles according to Publication Type, and emphasize how feature sets can be made more powerful and robust through the use of multiple, heterogeneous similarity measures as input to machine learning models. We then discuss how the generic platform can be extended to include a wide variety of other machine learning based goals and projects, and can be used as a public platform for disseminating the results of NLP tools to end-users as well.

show abstract

“…In addition to LSA, there have also been several recent advances in conceptual analysis options, perhaps the most notable being Google's word2vec (Mikolov, Sutskever, Chen, Corrado, & Dean, 2013). Though powerful and fairly easy to implement with specialized packages (e.g., the Gensim library; Rehurek & Sojka, 2010), these new methods still suffer in part from a crucial drawback shared with LSA, in that the embeddings used to assess semantic similarity are highdimensional mathematical spaces whose intrinsic meaning can be challenging to apprehend (Smalheiser & Bonifield, 2018). Though there has been research into techniques that attempt to address this issue (e.g., Luo, Liu, Luan, & Sun, 2015;Park, Bak, & Oh, 2017), generally these approaches make both the interpretation of the dimensions of the semantic space and understanding of the influence of specific keywords difficult.…”

Section: Quantifying Semantic Contentmentioning

confidence: 99%

“…Though there has been research into techniques that attempt to address this issue (e.g., Luo, Liu, Luan, & Sun, 2015;Park, Bak, & Oh, 2017), generally these approaches make both the interpretation of the dimensions of the semantic space and understanding of the influence of specific keywords difficult. Further, though some of the simplicity of using word2vec comes from using pretrained embeddings, these spaces may not be optimal for particular applications, and training new embeddings can present several challenges (Smalheiser & Bonifield, 2018). CRA offers an enticing alternative that addresses these concerns.…”

Section: Quantifying Semantic Contentmentioning

confidence: 99%

Beyond frequency counts: Novel conceptual recurrence analysis metrics to index semantic coordination in team communications

Tolston¹,

Riley

Mancuso

et al. 2018

Behav Res

View full text Add to dashboard Cite

Semantic alignment is a key process underlying interpersonal and team communication. However, semantic similarity is difficult to quantify, and statistical approaches designed to measure it often rely on methods that make the identification of the relative importance of key words difficult. This study outlines how conceptual recurrence analysis (CRA) can address these issues and can be used to detect conceptual structure in interpersonal communication. We developed several novel CRA metrics to analyze communication data reported previously by Mancuso, Finomore, Rahill, Blair, and Funke (Proceedings of the Human Factors and Ergonomics Society Annual Meeting, 58, 405-409, 2014), gathered from teams who worked cooperatively on a logic puzzle under different cognitive biasing contexts. CRA, like other measures of semantic coordination, relies on parameters whose values affect estimates of semantic alignment. We evaluated how the dimensionality of semantic spaces affects metrics quantifying the conceptual similarity of communicative exchanges, and whether metrics calculated from top-down, a priori semantic spaces or bottom-up semantic spaces empirically derived from each data set were more sensitive to biasing context. We found that the novel CRA measures were sensitive to manipulations of cognitive bias, and that higher-dimensional, bottom-up semantic spaces generally yielded more sensitivity to the experimental manipulations, though when the communication was evaluated with respect to specific key concepts, lower-dimensional, top-down spaces performed nearly as well. We conclude that CRA is sensitive to experimental manipulations in ways consistent with prior findings and that it presents a customizable framework for testing predictions about interpersonal communication patterns and other linguistic exchanges.

show abstract

Unsupervised low-dimensional vector representations for words, phrases and text that are transparent, scalable, and produce similarity metrics that are not redundant with neural embeddings

Cited by 22 publications

References 34 publications

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets

Design of a generic, open platform for machine learning-assisted indexing and clustering of articles in PubMed, a biomedical bibliographic database

Beyond frequency counts: Novel conceptual recurrence analysis metrics to index semantic coordination in team communications

Contact Info

Product

Resources

About