Background Neural network based embedding models are receiving significant attention in the field of natural language processing due to their capability to effectively capture semantic information representing words, sentences or even larger text elements in low-dimensional vector space. While current state-of-the-art models for assessing the semantic similarity of textual statements from biomedical publications depend on the availability of laboriously curated ontologies, unsupervised neural embedding models only require large text corpora as input and do not need manual curation. In this study, we investigated the efficacy of current state-of-the-art neural sentence embedding models for semantic similarity estimation of sentences from biomedical literature. We trained different neural embedding models on 1.7 million articles from the PubMed Open Access dataset, and evaluated them based on a biomedical benchmark set containing 100 sentence pairs annotated by human experts and a smaller contradiction subset derived from the original benchmark set. Results Experimental results showed that, with a Pearson correlation of 0.819, our best unsupervised model based on the Paragraph Vector Distributed Memory algorithm outperforms previous state-of-the-art results achieved on the BIOSSES biomedical benchmark set. Moreover, our proposed supervised model that combines different string-based similarity metrics with a neural embedding model surpasses previous ontology-dependent supervised state-of-the-art approaches in terms of Pearson’s r ( r = 0.871) on the biomedical benchmark set. In contrast to the promising results for the original benchmark, we found our best models’ performance on the smaller contradiction subset to be poor. Conclusions In this study, we have highlighted the value of neural network-based models for semantic similarity estimation in the biomedical domain by showing that they can keep up with and even surpass previous state-of-the-art approaches for semantic similarity estimation that depend on the availability of laboriously curated ontologies, when evaluated on a biomedical benchmark set. Capturing contradictions and negations in biomedical sentences, however, emerged as an essential area for further work. Electronic supplementary material The online version of this article (10.1186/s12859-019-2789-2) contains supplementary material, which is available to authorized users.
(1) Background: Cardiac amyloidosis (CA) is a rare and complex condition with poor prognosis. While novel therapies improve outcomes, many affected individuals remain undiagnosed due to a lack of awareness among clinicians. This study was undertaken to develop an expert-independent machine learning (ML) prediction model for CA relying on routinely determined laboratory parameters. (2) Methods: In a first step, we developed baseline linear models based on logistic regression. In a second step, we used an ML algorithm based on gradient tree boosting to improve our linear prediction model, and to perform non-linear prediction. Then, we compared the performance of all diagnostic algorithms. All prediction models were developed on a training cohort, consisting of patients with proven CA (positive cases, n = 121) and amyloidosis-unrelated heart failure (HF) patients (negative cases, n = 415). Performances of all prediction models were evaluated on a separate prognostic validation cohort with 37 CA-positive and 124 CA-negative patients. (3) Results: Our best model, based on gradient-boosted ensembles of decision trees, achieved an area under the receiver operating characteristic curve (ROC AUC) score of 0.86, with sensitivity and specificity of 89.2% and 78.2%, respectively. The best linear model had an ROC AUC score of 0.75, with sensitivity and specificity of 84.6 and 71.7, respectively. (4) Conclusions: Our work demonstrates that ML makes it possible to utilize basic laboratory parameters to generate a distinct CA-related HF profile compared with CA-unrelated HF patients. This proof-of-concept study opens a potential new avenue in the diagnostic workup of CA and may assist physicians in clinical reasoning.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.