Objective This article methodically reviews the literature on deep learning (DL) for natural language processing (NLP) in the clinical domain, providing quantitative analysis to answer 3 research questions concerning methods, scope, and context of current research. Materials and Methods We searched MEDLINE, EMBASE, Scopus, the Association for Computing Machinery Digital Library, and the Association for Computational Linguistics Anthology for articles using DL-based approaches to NLP problems in electronic health records. After screening 1,737 articles, we collected data on 25 variables across 212 papers. Results DL in clinical NLP publications more than doubled each year, through 2018. Recurrent neural networks (60.8%) and word2vec embeddings (74.1%) were the most popular methods; the information extraction tasks of text classification, named entity recognition, and relation extraction were dominant (89.2%). However, there was a “long tail” of other methods and specific tasks. Most contributions were methodological variants or applications, but 20.8% were new methods of some kind. The earliest adopters were in the NLP community, but the medical informatics community was the most prolific. Discussion Our analysis shows growing acceptance of deep learning as a baseline for NLP research, and of DL-based NLP in the medical community. A number of common associations were substantiated (eg, the preference of recurrent neural networks for sequence-labeling named entity recognition), while others were surprisingly nuanced (eg, the scarcity of French language clinical NLP with deep learning). Conclusion Deep learning has not yet fully penetrated clinical NLP and is growing rapidly. This review highlighted both the popular and unique trends in this active field.
Glioblastoma is a highly invasive brain tumor, whose cells infiltrate surrounding normal brain tissue beyond the lesion outlines visible in the current medical scans. These infiltrative cells are treated mainly by radiotherapy. Existing radiotherapy plans for brain tumors derive from population studies and scarcely account for patient-specific conditions. Here we provide a Bayesian machine learning framework for the rational design of improved, personalized radiotherapy plans using mathematical modeling and patient multimodal medical scans. Our method, for the first time, integrates complementary information from high resolution MRI scans and highly specific FET-PET metabolic maps to infer tumor cell density in glioblastoma patients. The Bayesian framework quantifies imaging and modeling uncertainties and predicts patient-specific tumor cell density with credible intervals. The proposed methodology relies only on data acquired at a single time point and thus is applicable to standard clinical settings. An initial clinical population study shows that the radiotherapy plans generated from the inferred tumor cell infiltration maps spare more healthy tissue thereby reducing radiation toxicity while yielding comparable accuracy with standard radiotherapy protocols. Moreover, the inferred regions of high tumor cell densities coincide with the tumor radioresistant areas, providing guidance for personalized doseescalation. The proposed integration of multimodal scans and mathematical modeling provides a robust, non-invasive tool to assist personalized radiotherapy design.
Understanding the role of a given transcription factor (TF) in regulating gene expression requires precise mapping of its binding sites in the genome. Chromatin immunoprecipitation-exo, an emerging technique using λ exonuclease to digest TF unbound DNA after ChIP, is designed to reveal transcription factor binding site (TFBS) boundaries with near-single nucleotide resolution. Although ChIP-exo promises deeper insights into transcription regulation, no dedicated bioinformatics tool exists to leverage its advantages. Most ChIP-seq and ChIP-chip analytic methods are not tailored for ChIP-exo, and thus cannot take full advantage of high-resolution ChIP-exo data. Here we describe a novel analysis framework, termed MACE (model-based analysis of ChIP-exo) dedicated to ChIP-exo data analysis. The MACE workflow consists of four steps: (i) sequencing data normalization and bias correction; (ii) signal consolidation and noise reduction; (iii) single-nucleotide resolution border peak detection using the Chebyshev Inequality and (iv) border matching using the Gale-Shapley stable matching algorithm. When applied to published human CTCF, yeast Reb1 and our own mouse ONECUT1/HNF6 ChIP-exo data, MACE is able to define TFBSs with high sensitivity, specificity and spatial resolution, as evidenced by multiple criteria including motif enrichment, sequence conservation, direct sequence pileup, nucleosome positioning and open chromatin states. In addition, we show that the fundamental advance of MACE is the identification of two boundaries of a TFBS with high resolution, whereas other methods only report a single location of the same event. The two boundaries help elucidate the in vivo binding structure of a given TF, e.g. whether the TF may bind as dimers or in a complex with other co-factors.
The theory and application of compressive sensing (CS) have received a lot of interest in recent years. The basic idea in CS is to use a specially-designed sensor to sample signals that are sparse in some basis (e.g. wavelet basis) directly in a compressed form, and then to reconstruct (decompress) these signals accurately using some inversion algorithm after transmission to a central processing unit. However, many signals in reality are only approximately sparse, where only a relatively small number of the signal coefficients in some basis are significant and the remaining basis coefficients are relatively small but they are not all zero. In this case, perfect reconstruction from compressed measurements is not expected. In this paper, a Bayesian CS algorithm is proposed for the first time to reconstruct approximately sparse signals. A robust treatment of the uncertain parameters is explored, including integration over the prediction-error precision parameter to remove it as a "nuisance" parameter, and introduction of a successive relaxation procedure for the required optimization of the basis coefficient hyper-parameters. The performance of the algorithm is investigated using compressed data from synthetic signals and real signals from structural health monitoring systems installed on a space-frame structure and on a cable-stayed bridge. Compared with other state-of-the-art CS methods, including our previously-published Bayesian method, the new CS algorithm shows superior performance in reconstruction robustness and posterior uncertainty quantification, for approximately sparse signals. Furthermore, our method can be utilized for recovery of lost data during wireless transmission, even if the level of sparseness in the signal is low.
A review of published work in clinical natural language processing (NLP) may suggest that the negation detection task has been “solved.” This work proposes that an optimizable solution does not equal a generalizable solution. We introduce a new machine learning-based Polarity Module for detecting negation in clinical text, and extensively compare its performance across domains. Using four manually annotated corpora of clinical text, we show that negation detection performance suffers when there is no in-domain development (for manual methods) or training data (for machine learning-based methods). Various factors (e.g., annotation guidelines, named entity characteristics, the amount of data, and lexical and syntactic context) play a role in making generalizability difficult, but none completely explains the phenomenon. Furthermore, generalizability remains challenging because it is unclear whether to use a single source for accurate data, combine all sources into a single model, or apply domain adaptation methods. The most reliable means to improve negation detection is to manually annotate in-domain training data (or, perhaps, manually modify rules); this is a strategy for optimizing performance, rather than generalizing it. These results suggest a direction for future work in domain-adaptive and task-adaptive methods for clinical NLP.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.