Although machine learning has become a powerful tool to augment doctors in clinical analysis, the immense amount of labeled data that is necessary to train supervised learning approaches burdens each development task as time and resource intensive. The vast majority of dense clinical information is stored in written reports, detailing pertinent patient information. The challenge with utilizing natural language data for standard model development is due to the complex and unstructured nature of the modality. In this research, a model pipeline was developed to utilize an unsupervised approach to train an encoder-language model, a bidirectional recurrent neural network, to generate document encodings; which then can be used as features passed into a decoder-classifier model that requires magnitudes less labeled data than previous approaches to differentiate between fine-grained disease classes accurately. The language model was trained on unlabeled radiology reports from the Massachusetts General Hospital Radiology Department (n=218,159) and terminated with a loss of 1.62 and a word prediction accuracy of 62%. The classification models were trained on three labeled datasets of head CT studies of reported patients, presenting large vessel occlusion (n=1403), acute ischemic strokes (n=331), and intracranial hemorrhage (n=4350), to identify a variety of different findings directly from the radiology report data; resulting in AUCs of 0.98, 0.95, and 0.99, respectively, for the large vessel occlusion, acute ischemic stroke, and intracranial hemorrhage datasets. The output encodings are able to be used in conjunction with imaging data, to create models that can process a multitude of different modalities. The ability to automatically extract relevant features from textual data allows for faster model development and integration of Preprint. Under review.
With availability of voluminous sets of observational data, an empirical paradigm to screen for drug repurposing opportunities (i.e., beneficial effects of drugs on nonindicated outcomes) is feasible. In this article, we use a linked claims and electronic health record database to comprehensively explore repurposing effects of antihypertensive drugs. We follow a target trial emulation framework for causal inference to emulate randomized controlled trials estimating confounding adjusted effects of antihypertensives on each of 262 outcomes of interest. We then fit hierarchical models to the results as a form of postprocessing to account for multiple comparisons and to sift through the results in a principled way. Our motivation is twofold. We seek both to surface genuinely intriguing drug repurposing opportunities and to elucidate through a real application some study design decisions and potential biases that arise in this context.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.