Introduction: Administrative data permit analysis of large cohorts but rely on ICD-9-CM and ICD-10-CM billing codes that may not accurately identify CHD cases. Variables that may improve accuracy are unknown, yet improved accuracy will improve CHD surveillance. Methods: We validated 1500 cases with an encounter between 1/1/2010 - 12/31/2019 identified by at least one of 90 CHD codes (41 ICD-9-CM, 49 ICD-10-CM) in 2 healthcare systems (1 adult, 1 pediatric), through medical record review and chart abstraction for presence of a CHD. Inter- and intra-observer reliability exceeded 93%. Results: Positive predictive value (PPV) of ICD codes for CHD (Figure 1) was 68.0% (1020/1500) overall, 95.7% (247/258) for severe codes, 52.8% (371/703) for shunt codes, 75.2% (243/323) for valve codes, 73.2% (120/164) for shunt and valve codes, and 75.0% (39/52) for a select group of 7 codes in ‘other’ category. PPV for cases with > 1 unique CHD code was 73.1% (920/1259) vs. 41.5% (100/241) for cases with only 1 unique CHD code. Characteristics of cases with and without CHD are in Table 1. ICD code 745.5/Q21.1 in isolation was present in 2.2% of cases with confirmed CHD vs. 19.4% of cases without CHD (p< 0.0001). Median number of encounters with a CHD code was higher in cases with CHD (6) vs. without CHD (2), p< 0.0001. Patent foramen ovale was present in 65.8% of false positives (316/480). Conclusion: There is significant variability in the PPV of individual and groups of CHD codes for detection of CHD. The presence of a code for severe CHD is associated with high PPV for true CHD. Use of administrative data for CHD surveillance may require the development of algorithms to improve the accuracy of case detection.
Background: Administrative data permits analysis of large cohorts but relies on International Classification of Diseases, Ninth and Tenth Revision, Clinical Modification (ICD) codes that may not reflect true congenital heart defects (CHD). Methods: 1497 cases with at least one encounter between 1/1/2010 — 12/31/2019 in two healthcare systems (one adult, one pediatric) identified by at least one of 87 ICD CHD codes were validated through chart review for the presence of CHD and CHD anatomic group. Results: Inter- and intra-observer reliability averaged > 95%. Positive predictive value (PPV) of ICD codes for CHD was 68.1% (1020/1497) overall, 94.6% (123/130) for cases identified in both healthcare systems, 95.8% (249/260) for severe codes, 52.6% (370/703) for shunt codes, 75.9% (243/320) for valve codes, 73.5% (119/162) for shunt and valve codes, and 75.0% (39/52) for "Other CHD" (7 ICD codes). PPV for cases with >1 unique CHD code was 85.4% (503/589) vs. 56.3% (498/884) for one CHD code. Of cases with secundum atrial septal defect ICD codes 745.5/Q21.1 in isolation, 30.9% (123/398) had a confirmed CHD. Patent foramen ovale was present in 66.2% (316/477) of false positives (FP). The median number of unique CHD-coded encounters was higher for true positives (TP) than FP (2.0; interquartile range [IQR]: 1.0-3.0 vs 1.0; IQR:1.0-1.0, respectively, p<0.0001). TP had younger mean age at first encounter with a CHD code than FP (22.4 years vs 26.3 years, p=0.0017). Conclusion: PPV of CHD ICD codes varies by characteristics for detection of CHD by ICD code and anatomic grouping. While an ICD code for severe CHD and/or the presence of a case in more than one data source, regardless of anatomic group, is associated with higher PPV for CHD, most TP cases did not have these characteristics. The development of algorithms to improve accuracy may improve administrative data for CHD surveillance.
Introduction: The Fontan operation palliates single ventricle heart defects. As native anatomy varies, Fontan cases cannot always be identified by ICD9 or 10CM codes. Hypothesis: We sought to train and evaluate a supervised machine learning (ML) system to identify Fontan cases based on unstructured clinical notes in a large database. Methods: 160 adult Fontan patients from validated clinical data at a single tertiary referral center with available text notes were studied. The imbalanced data set had more non-Fontan cases than Fontan patients; thus we created multiple datasets with different positive : negative case ratios ranging from 1:2 to 1:10. We used stratified 80-20 training-testing splits of data. Vectorized representations of text notes were used as features. We trained a Support Vector Machine (SVM) model, mostly used for text classification, to identify Fontan cases from notes. For each dataset, we performed random 80-20 data splitting 10 times and reported average F 1 score (harmonic mean of recall/sensitivity & precision/positive predictive value) over the positive class. Results: The model achieved a mean F 1 score of 0.95 for the positive class on the data split with 1:2 positive-negative ratio. Increasing data imbalance from 1:2 to 1:10 did not substantially impact performance. The mean F 1 score over all data splits was 0.94, and SD 0.01. We also computed precision, recall, and F 1 score of ICD codes to identify Fontan patients. Performance comparisons between ICD codes only and Natural Language Processing (NLP)/ML are in Table 1. Conclusions: A supervised classification model more effectively detects Fontan patients based on clinical notes with higher accuracy than ICD codes. The model is robust and insensitive to data imbalance. Findings suggest our model may work effectively in real-world data. Since the sensitivity of ICD codes is high but PPV is low, it may be beneficial to apply ICD codes as a filter prior to applying NLP/ML to improve performance.
Background The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by International Classification of Diseases ( ICD ) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with ICD code–based classification. Methods and Results We included free‐text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non‐Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer‐based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and ICD code–based classification on 20% of the held‐out patient data using the F 1 score metric. The ICD classification model, support vector machine, and RoBERTa achieved F 1 scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance ( P <0.05), and both natural language processing models outperformed ICD code–based classification ( P <0.05). The sliding window strategy improved performance over the base model ( P <0.05) but did not outperform support vector machines. ICD code–based classification produced more false positives. Conclusions Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than ICD codes, and the former demonstrated the possibility of further improvement.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.