Background
The Fontan operation is associated with significant morbidity and premature mortality. Fontan cases cannot always be identified by
International Classification of Diseases
(
ICD
) codes, making it challenging to create large Fontan patient cohorts. We sought to develop natural language processing–based machine learning models to automatically detect Fontan cases from free texts in electronic health records, and compare their performances with
ICD
code–based classification.
Methods and Results
We included free‐text notes of 10 935 manually validated patients, 778 (7.1%) Fontan and 10 157 (92.9%) non‐Fontan, from 2 health care systems. Using 80% of the patient data, we trained and optimized multiple machine learning models, support vector machines and 2 versions of RoBERTa (a robustly optimized transformer‐based model for language understanding), for automatically identifying Fontan cases based on notes. For RoBERTa, we implemented a novel sliding window strategy to overcome its length limit. We evaluated the machine learning models and
ICD
code–based classification on 20% of the held‐out patient data using the
F
1
score metric. The
ICD
classification model, support vector machine, and RoBERTa achieved
F
1
scores of 0.81 (95% CI, 0.79–0.83), 0.95 (95% CI, 0.92–0.97), and 0.89 (95% CI, 0.88–0.85) for the positive (Fontan) class, respectively. Support vector machines obtained the best performance (
P
<0.05), and both natural language processing models outperformed
ICD
code–based classification (
P
<0.05). The sliding window strategy improved performance over the base model (
P
<0.05) but did not outperform support vector machines.
ICD
code–based classification produced more false positives.
Conclusions
Natural language processing models can automatically detect Fontan patients based on clinical notes with higher accuracy than
ICD
codes, and the former demonstrated the possibility of further improvement.