Objective Accurate and rapid phenotyping is a prerequisite to leveraging electronic health records for biomedical research. While early phenotyping relied on rule-based algorithms curated by experts, machine learning (ML) approaches have emerged as an alternative to improve scalability across phenotypes and healthcare settings. This study evaluates ML-based phenotyping with respect to (1) the data sources used, (2) the phenotypes considered, (3) the methods applied, and (4) the reporting and evaluation methods used. Materials and methods We searched PubMed and Web of Science for articles published between 2018 and 2022. After screening 850 articles, we recorded 37 variables on 100 studies. Results Most studies utilized data from a single institution and included information in clinical notes. Although chronic conditions were most commonly considered, ML also enabled the characterization of nuanced phenotypes such as social determinants of health. Supervised deep learning was the most popular ML paradigm, while semi-supervised and weakly supervised learning were applied to expedite algorithm development and unsupervised learning to facilitate phenotype discovery. ML approaches did not uniformly outperform rule-based algorithms, but deep learning offered a marginal improvement over traditional ML for many conditions. Discussion Despite the progress in ML-based phenotyping, most articles focused on binary phenotypes and few articles evaluated external validity or used multi-institution data. Study settings were infrequently reported and analytic code was rarely released. Conclusion Continued research in ML-based phenotyping is warranted, with emphasis on characterizing nuanced phenotypes, establishing reporting and evaluation standards, and developing methods to accommodate misclassified phenotypes due to algorithm errors in downstream applications.
ObjectiveAccurate and rapid methods for phenotyping are a prerequisite to realizing the potential of electronic health records (EHRs) data for clinical and translational research. This study reviews the literature on machine learning (ML) approaches for phenotyping with respect to the phenotypes considered, the data sources and methods used, and the contributions within the wider context of EHR-based research.Materials and MethodsWe searched for relevant articles in PubMed and Web of Science published between January 1, 2018 and April 14, 2022. After screening, we collected data on 52 variables across 106 selected articles.ResultsML-based methods were developed for 156 unique phenotypes, primarily using EHR data from a single institution or health system. 72 of 106 articles leveraged unstructured data in clinical notes. In terms of methodology, supervised learning is the most prevalent ML paradigm (n = 64, 60.4%), with half of the articles employing deep learning. Semi-supervised and weakly-supervised approaches were applied to reduce the burden of obtaining gold-standard labeled data (n = 21, 19.8%), while unsupervised learning was used for phenotype discovery (n = 20, 18.9%). Federated learning has been applied to develop algorithms across multiple institutions while preserving data privacy (n = 2, 1.9%).DiscussionWhile the use of ML for phenotyping is growing, most articles applied traditional supervised ML to characterize the presence of common, chronic conditions.ConclusionContinued research in ML-based methods is warranted, with particular attention to the development of advanced methods for complex phenotypes and standards for reporting and evaluating phenotyping algorithms.
The general problem of constructing regions that have a guaranteed coverage probability for an arbitrary parameter of interest $$\psi \in \Psi $$ ψ ∈ Ψ is considered. The regions developed are Bayesian in nature and the coverage probabilities can be considered as Bayesian confidences with respect to the model obtained by integrating out the nuisance parameters using the conditional prior given $$\psi .$$ ψ . Both the prior coverage probability and the prior probability of covering a false value (the accuracy) can be controlled by setting the sample size. These coverage probabilities are considered as a priori figures of merit concerning the reliability of a study while the inferences quoted are Bayesian. Several problems are considered where obtaining confidence regions with desirable properties have proven difficult to obtain. For example, it is shown that the approach discussed never leads to improper regions which has proven to be an issue for some confidence regions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.