Background: Electronic health records represent a large data source for outcomes research, but the majority of EHR data is unstructured (e.g. free text of clinical notes) and not conducive to computational methods. While there are currently approaches to handle unstructured data, such as manual abstraction, structured proxy variables, and model-assisted abstraction, these methods are time-consuming, not scalable, and require clinical domain expertise. This paper aims to determine whether selective prediction, which gives a model the option to abstain from generating a prediction, can improve the accuracy and efficiency of unstructured clinical data abstraction. Methods: We trained selective prediction models to identify the presence of four distinct clinical variables in free-text pathology reports: primary cancer diagnosis of glioblastoma (GBM, n = 659), resection of rectal adenocarcinoma (RRA, n = 601), and two procedures for resection of rectal adenocarcinoma: abdominoperineal resection (APR, n = 601) and low anterior resection (LAR, n = 601). Data were manually abstracted from pathology reports and used to train L1-regularized logistic regression models using term-frequency-inverse-document-frequency features. Data points that the model was unable to predict with high certainty were manually abstracted. Findings: All four selective prediction models achieved a test-set sensitivity, specificity, positive predictive value, and negative predictive value above 0.91. The use of selective prediction led to sizable gains in automation (anywhere from 57% to 95% reduction in manual abstraction of charts across the four outcomes). For our GBM classifier, the selective prediction model saw improvements to sensitivity (0.94 to 0.96), specificity (0.79 to 0.96), PPV (0.89 to 0.98), and NPV (0.88 to 0.91) when compared to a non-selective classifier. Interpretation: Selective prediction using utility-based probability thresholds can facilitate unstructured data extraction by giving "easy" charts to a model and "hard" charts to human abstractors, thus increasing efficiency while maintaining or improving accuracy.
BackgroundData on lines of therapy (LOTs) for cancer treatment is important for clinical oncology research, but LOTs are not explicitly recorded in EHRs. We present an efficient approach for clinical data abstraction and a flexible algorithm to derive LOTs from EHR-based medication data on patients with glioblastoma (GBM).MethodsNon-clinicians were trained to abstract the diagnosis of GBM from EHRs, and their accuracy was compared to abstraction performed by clinicians. The resulting data was used to build a cohort of patients with confirmed GBM diagnosis. An algorithm was developed to derive LOTs using structured medication data, accounting for the addition and discontinuation of therapies and drug class. Descriptive statistics were calculated and time-to-next-treatment analysis was performed using the Kaplan-Meier method.ResultsTreating clinicians as the gold standard, non-clinicians abstracted GBM diagnosis with sensitivity 0.98, specificity 1.00, PPV 1.00, and NPV 0.90, suggesting that non-clinician abstraction of GBM diagnosis was comparable to clinician abstraction. Out of 693 patients with a confirmed diagnosis of GBM, 246 patients contained structured information about the types of medications received. Of those, 165 (67.1%) received a first-line therapy (1L) of temozolomide, and the median time-to-next-treatment from the start of 1L was 179 days.ConclusionsWe also developed a flexible, interpretable, and easy-to-implement algorithm to derive LOTs given EHR data on medication orders and administrations that can be used to create high-quality datasets for outcomes research. We also showed that the cost of chart abstraction can be reduced by training non-clinicians instead of clinicians.Importance of the studyThis study proposes an efficient and accurate method to extract unstructured data from electronic health records (EHRs) for cancer outcomes research. The study addresses the limitations of manual abstraction of unstructured clinical data and presents a reproducible, low-cost workflow for clinical data abstraction and a flexible algorithm to derive lines of therapy (LOTs) from EHR-based structured medication data. The LOT data was used to conduct a descriptive treatment pattern analysis and a time-to-next-treatment analysis to demonstrate how EHR-derived unstructured data can be transformed to answer diverse clinical research questions. The study also investigates the feasibility of training non-clinicians to perform abstraction of GBM data, demonstrating that with detailed explanations of clinical documentation, best practices for chart review, and quantitative evaluation of abstraction performance, similar data quality to abstraction performed by clinicians can be achieved. The findings of this study have important implications for improving cancer outcomes research and facilitating the analysis of EHR-derived treatment data.
OBJECTIVE Machine learning (ML) has become an increasingly popular tool for use in neurosurgical research. The number of publications and interest in the field have recently seen significant expansion in both quantity and complexity. However, this also places a commensurate burden on the general neurosurgical readership to appraise this literature and decide if these algorithms can be effectively translated into practice. To this end, the authors sought to review the burgeoning neurosurgical ML literature and to develop a checklist to help readers critically review and digest this work. METHODS The authors performed a literature search of recent ML papers in the PubMed database with the terms "neurosurgery" AND "machine learning," with additional modifiers "trauma," "cancer," "pediatric," and "spine" also used to ensure a diverse selection of relevant papers within the field. Papers were reviewed for their ML methodology, including the formulation of the clinical problem, data acquisition, data preprocessing, model development, model validation, model performance, and model deployment. RESULTS The resulting checklist consists of 14 key questions for critically appraising ML models and development techniques; these are organized according to their timing along the standard ML workflow. In addition, the authors provide an overview of the ML development process, as well as a review of key terms, models, and concepts referenced in the literature. CONCLUSIONS ML is poised to become an increasingly important part of neurosurgical research and clinical care. The authors hope that dissemination of education on ML techniques will help neurosurgeons to critically review new research better and more effectively integrate this technology into their practices.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.