Background
Among patients colonized with carbapenem-resistant Klebsiella pneumoniae (CRKP), only a subset develop clinical infection. While patient characteristics may influence risk for infection, it remains unclear if the genetic background of CRKP strains contributes to this risk. We applied machine learning to quantify the capacity of patient characteristics and microbial genotypes to discriminate infection and colonization, and identified patient and microbial features associated with infection across multiple healthcare facilities.
Methods
Machine learning models were built using whole-genome sequences and clinical metadata from 331 patients colonized or infected with CRKP across 21 long-term acute care hospitals. To quantify variation in performance, we built models using 100 different train/test splits of the entire dataset, and urinary and respiratory site-specific subsets, and evaluated predictive performance on each test split using the area under the receiver operating characteristics curve (AUROC). Patient and microbial features predictive of infection were identified as those consistently important for predicting infection based on average change in AUROC when included in the model.
Findings
We found that patient and genomic features were only weakly predictive of clinical CRKP infection vs. colonization (AUROC IQRs: patient=0.59-0.68, genomic=0.55-0.61, combined=0.62-0.68), and that one feature set did not consistently outperform the other (genomic vs. patient p=0.4). Comparable model performances were observed for anatomic site-specific models (combined AUROC IQRs: respiratory=0.61-0.71, urinary=0.54-0.64). Strong genomic predictors of infection included the presence of the ICEKp10 mobile genetic element carrying an iron acquisition system (yersiniabactin) and a toxin (colibactin), along with disruption of an O-antigen biosynthetic gene in a sub-lineage of the epidemic ST258 clone. Teasing apart sequential evolutionary steps in the context of clinical metadata indicated that altered O-antigen biosynthesis increased association with the respiratory tract, and subsequent acquisition of ICEKp10 was associated with increased virulence.
Interpretation
Our results support the need for rigorous machine learning frameworks to gain realistic estimates of the performance of clinical models of infection. Moreover, integrating microbial genomic and clinical data using such a framework can help tease apart the contribution of microbial genetic variation to clinical outcomes.
Funding
Centers for Disease Control and Prevention, National Institutes of Health, National Science Foundation