Abbreviations
AIartificial intelligence CI confidence interval ME major error, susceptible genomes predicted to be resistant MIC minimum inhibitory concentration PATRIC Pathosystems resource integration center PLF PATRIC local protein family RAST Rapid annotation using subsystem technology RF Random Forest SR susceptible and resistant VME very major error, resistant genomes predicted to be susceptible XGB XGBoost Abstract A growing number of studies have shown that machine learning algorithms can be used to accurately predict antimicrobial resistance (AMR) phenotypes from bacterial sequence data. In these studies, models are typically trained using input features derived from comprehensive sets of known AMR genes or whole genome sequences. However, it can be difficult to determine whether genomes and their corresponding sets of AMR genes are complete when sequencing contaminated or metagenomic samples. In this study, we explore the possibility of using incomplete genome sequence data to predict AMR phenotypes. Machine learning models were built from randomly-selected sets of core genes that are held in common among the members of a species, and the AMR-conferring genes were removed based on their protein annotations. For Klebsiella pneumoniae, Mycobacterium tuberculosis, Salmonella enterica, and Staphylococcus aureus, we report that it is possible to classify susceptible and resistant phenotypes with average F1 scores ranging from 0.80-0.89 with as few as 100 conserved non-AMR genes, with very major error rates ranging from 0.11-0.23 and major error rates ranging from 0.10-0.20. Models built from core genes have predictive power in the cases where the primary AMR mechanism results from SNPs or horizontal gene transfer. By randomly sampling non-overlapping sets of core genes for use in these models, we show that F1 scores and error rates are stable and have little variance between replicates. Potential biases from strainspecific SNPs, phylogenetic sampling, and imbalances in the phylogenetic distribution of susceptible and resistant strains do not appear to have an impact on this result. Although these small core gene models have lower accuracies and higher error rates than models built from the corresponding assembled genomes, the results suggest that sufficient variation exists in the core non-AMR genes of a species for predicting AMR phenotypes. Overall this study suggests that building models from conserved genes may be a potentially useful strategy for predicting AMR phenotypes when genomes are incomplete.