Generative transfer learning for measuring plausibility of EHR diagnosis records

Estiri, Hossein; Vasey, Sebastien; Murphy, Shawn N.

doi:10.1093/jamia/ocaa215

Cited by 14 publications

(15 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We followed the same analytic process used by Estiri et al (2021) 30 that was used to identify risk factors for COVID-19 mortality from EHR data. From the MLHO framework, the computational process to conduct multivariate PheWAS involved applying the Minimize Sparsity, Maximize Relevance (MSMR) 23,31,32 algorithm, clinical expertise, and multinomial generalized linear modeling (GLM) with component-wise functional gradient boosting, and a composite confidence score to identify the phenotypes that are positively associated with a past PCR test (see eMethods).…”

Section: Methodsmentioning

confidence: 99%

“…We followed a similar analytic process used by [31] that was used to identify risk factors for COVID-19 mortality from EHR data. From the MLHO framework, the computational process involved applying the Minimize Sparsity, Maximize Relevance (MSMR) algorithm, [23,32,33] clinical expertise, and multivariate boosting logistic regression, to compute a composite confidence score for identifying the phenotypes that are positively associated with a past RT-PCR test (see eMethods for more details).…”

Section: Mlho Frameworkmentioning

confidence: 99%

“…Due to the known reliability issues of EHR diagnosis records, [33,34] we validated the phenotypes identified by MLHO through chart reviews. A clinical expert reviewed the clinical notes and longitudinal records for a random sample of five patients for each phenotype identified by MLHO with an 80-plus confidence score.…”

Section: Clinical Validation Via Chart Reviewsmentioning

confidence: 99%

See 2 more Smart Citations

Evolving Phenotypes of non-hospitalized Patients that Indicate Long Covid

Estiri

Strasser

Brat

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Many of the symptoms characterized as the post-acute sequelae of SARS-CoV-2 infection (PASC) could have multiple causes or similarly seen in non-COVID patients. An accurate identification of phenotypes will be important to guide future research and the healthcare system to focus its efforts and resources on adequately controlled age- and gender-specific sequelae of COVID-19 infection. This retrospective electronic health records (EHR) cohort study, we applied a computational framework for knowledge discovery from clinical data, MLHO, to identify phenotypes that positively associate with a past positive PCR test for COVID-19. We evaluated the post-test phenotypes in two temporal windows at 3-6 and 6-9 months after the test and by age and gender. We utilized longitudinal diagnosis records stored in EHRs from Mass General Brigham (MGB) 57 thousand patients who tested positive or negative for COVID-19 and were not hospitalized. Statistical analyses were performed on data from March 2020 to March 2021. PCR test results and subsequent diagnosis records that were recorded for the first time two months or later after the PCR test. We identified 28 phenotypes among different age/gender cohorts or time windows that positively associated with a past SARS-CoV-2 infection. All identified phenotypes were newly recorded in patients’ medical records two months or longer after a COVID-19 PCR test in non-hospitalized patients regardless of the test result. Among these phenotypes, a new diagnosis record for anosmia and dysgeusia (OR 2.17, 95% CI [1.42 - 3.25]), alopecia (OR 3.54, 95% CI [2.92 - 4.3]), chest pain (OR 1.35, 95% CI [1.16 - 1.56]), or chronic fatigue syndrome (OR 1.81-2.28, 95% CI [1.38 - 3.68]) are the most significant indicators of a past COVID-19 infection, especially among women younger than 65. Among men, edema (OR 1.83, 95% CI [1.23 - 2.66]) and disease of nail (OR 3.54, 95% CI [1.63 - 7.29]) in patients 65 and older or proteinuria (OR 2.66, 95% CI [1.61 - 4.34]) in patients under 65 are associated with a positive COVID-19 PCR test in the past few months. Our approach avoids a flood of false positive discoveries, while offering a more probabilistic flexible criterion than the standard linear phenome-wide association study (PheWAS). These findings suggest that some of the previously identified post sequelae of COVID-19 may not be accurate and that most of the PASC are observed in patients under 65 years of age.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Mlho Frameworkmentioning

confidence: 99%

Section: Clinical Validation Via Chart Reviewsmentioning

confidence: 99%

See 1 more Smart Citation

Evolving Phenotypes of non-hospitalized Patients that Indicate Long Covid

Estiri

Strasser

Brat

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Among the methods used to train word embeddings, Word2vec [98] and Bidirectional Encoder Representations from Transformers (BERT) and variants [99][100][101][102] are the most frequently used (Supplementary Material Table S9). Word embeddings typically serve as the input layer to phenotyping algorithms using deep learning models and have also been used to account for ambiguous abbreviations and spelling errors in clinical notes [25,27,29,32,34,42,69,72,[74][75][76][77]83,[93][94][95][96][97][103][104][105][106][107][108][109][110][111][112][113][114].…”

Section: Data Typesmentioning

confidence: 99%

“…Estiri et al utilized a self-learning approach to develop standard generative models (eg. Naive Bayes, Linear Discriminant Analysis) using a small set of labeled data (average 182 patients) and larger set of unlabeled (average 5956 patients) for classification of 18 phenotypes [114]. The approach performed on par with supervised learning, but required less labeled data (AUROC 0.78-0.99).…”

Section: Semi-supervised Learningmentioning

confidence: 99%

Machine Learning Approaches for Electronic Health Records Phenotyping: A Methodical Review

Yang

Varghese²,

Stephenson

et al. 2022

Preprint

View full text Add to dashboard Cite

ObjectiveAccurate and rapid methods for phenotyping are a prerequisite to realizing the potential of electronic health records (EHRs) data for clinical and translational research. This study reviews the literature on machine learning (ML) approaches for phenotyping with respect to the phenotypes considered, the data sources and methods used, and the contributions within the wider context of EHR-based research.Materials and MethodsWe searched for relevant articles in PubMed and Web of Science published between January 1, 2018 and April 14, 2022. After screening, we collected data on 52 variables across 106 selected articles.ResultsML-based methods were developed for 156 unique phenotypes, primarily using EHR data from a single institution or health system. 72 of 106 articles leveraged unstructured data in clinical notes. In terms of methodology, supervised learning is the most prevalent ML paradigm (n = 64, 60.4%), with half of the articles employing deep learning. Semi-supervised and weakly-supervised approaches were applied to reduce the burden of obtaining gold-standard labeled data (n = 21, 19.8%), while unsupervised learning was used for phenotype discovery (n = 20, 18.9%). Federated learning has been applied to develop algorithms across multiple institutions while preserving data privacy (n = 2, 1.9%).DiscussionWhile the use of ML for phenotyping is growing, most articles applied traditional supervised ML to characterize the presence of common, chronic conditions.ConclusionContinued research in ML-based methods is warranted, with particular attention to the development of advanced methods for complex phenotypes and standards for reporting and evaluating phenotyping algorithms.

show abstract