The reliable detection of novel bacterial pathogens from next-generation sequencing data is a key challenge for microbial diagnostics. Current computational tools usually rely on sequence similarity and often fail to detect novel species when closely related genomes are unavailable or missing from the reference database. Here we present the machine learning based approach PaPrBaG (Pathogenicity Prediction for Bacterial Genomes). PaPrBaG overcomes genetic divergence by training on a wide range of species with known pathogenicity phenotype. To that end we compiled a comprehensive list of pathogenic and non-pathogenic bacteria with human host, using various genome metadata in conjunction with a rule-based protocol. A detailed comparative study reveals that PaPrBaG has several advantages over sequence similarity approaches. Most importantly, it always provides a prediction whereas other approaches discard a large number of sequencing reads with low similarity to currently known reference genomes. Furthermore, PaPrBaG remains reliable even at very low genomic coverages. CombiningPaPrBaG with existing approaches further improves prediction results.The vast amount and diversity of bacteria on Earth, together with ever increasing human exposure 1 , suggests that we will be continuously confronted with novel bacterial pathogens, too. Encouragingly, next-generation sequencing (NGS) has emerged as a novel, powerful diagnostic tool in this regard. However, the direct NGS-based characterisation of novel pathogenic strains or even species is still problematic when closely related genomes are unavailable or missing from the respective reference database. Here we introduce a machine learning based approach, PaPrBaG, which overcomes genetic divergence in predicting bacterial pathogenicity by training on a wide range of species with known pathogenicity phenotype. Importantly, even if this is avoided for practical reasons at some points throughout this (and related) work, one may more cautiously speak of pathogenic potential than pathogenicity, given that the latter is ultimately governed by the complex interplay between host (state) and pathogen.
Existing MethodsExisting approaches amenable to pathogenicity prediction broadly fall into two classes: protein content based and whole-genome based. Where assembled genomes are available, the presence/absence pattern of certain protein families can be expected to correlate with complex phenotypes, e.g. pathogenicity. This is primarily based on the presence of virulence factors (VFs) -often acquired through horizontal gene transfer 2 -or the absence of more common genes (functions) that become dispensable when e.g. host-specific pathogens evolve from commensal ancestors 3 . Three recent studies rely on these considerations.The BacFier method by Iraola et al. 4 was the first to apply the described approach on a large scale. The authors defined eight VF categories and obtained 814 related VF protein families from KEGG 5 . They further used a set of 848 human-pathogenic (HP) and generally n...