High performance<i>Legionella pneumophila</i>source attribution using genomics-based machine learning classification

Buultjens, Andrew H.; Vandelannoote, Koen; Mercoulia, Karolina; Ballard, Susan A; Sloggett, Clare; Howden, Benjamin P; Seemann, Torsten; Stinear, Timothy P.

doi:10.1101/2023.03.19.532693

2023

DOI: 10.1101/2023.03.19.532693

|View full text |Cite

Preprint

High performanceLegionella pneumophilasource attribution using genomics-based machine learning classification

Andrew H. Buultjens

Koen Vandelannoote

Karolina Mercoulia

et al.

Abstract: Fundamental to effective Legionnaires′ disease outbreak control is the ability to rapidly identify the environmental source(s) of the causative agent,Legionella pneumophila. Genomics has revolutionized pathogen surveillance butL. pneumophilahas a complex ecology and population structure that can limit source inference based on standard core genome phylogenetics. Here we present a powerful machine learning approach that assigns the geographical source of Legionnaires′ disease outbreaks more accurately than curr… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2023

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data

Castelli,

De Ruvo,

Bucciacchio

et al. 2023

BMC Genomics

View full text Add to dashboard Cite

Background Genomic data-based machine learning tools are promising for real-time surveillance activities performing source attribution of foodborne bacteria such as Listeria monocytogenes. Given the heterogeneity of machine learning practices, our aim was to identify those influencing the source prediction performance of the usual holdout method combined with the repeated k-fold cross-validation method. Methods A large collection of 1 100 L. monocytogenes genomes with known sources was built according to several genomic metrics to ensure authenticity and completeness of genomic profiles. Based on these genomic profiles (i.e. 7-locus alleles, core alleles, accessory genes, core SNPs and pan kmers), we developed a versatile workflow assessing prediction performance of different combinations of training dataset splitting (i.e. 50, 60, 70, 80 and 90%), data preprocessing (i.e. with or without near-zero variance removal), and learning models (i.e. BLR, ERT, RF, SGB, SVM and XGB). The performance metrics included accuracy, Cohen’s kappa, F1-score, area under the curves from receiver operating characteristic curve, precision recall curve or precision recall gain curve, and execution time. Results The testing average accuracies from accessory genes and pan kmers were significantly higher than accuracies from core alleles or SNPs. While the accuracies from 70 and 80% of training dataset splitting were not significantly different, those from 80% were significantly higher than the other tested proportions. The near-zero variance removal did not allow to produce results for 7-locus alleles, did not impact significantly the accuracy for core alleles, accessory genes and pan kmers, and decreased significantly accuracy for core SNPs. The SVM and XGB models did not present significant differences in accuracy between each other and reached significantly higher accuracies than BLR, SGB, ERT and RF, in this order of magnitude. However, the SVM model required more computing power than the XGB model, especially for high amount of descriptors such like core SNPs and pan kmers. Conclusions In addition to recommendations about machine learning practices for L. monocytogenes source attribution based on genomic data, the present study also provides a freely available workflow to solve other balanced or unbalanced multiclass phenotypes from binary and categorical genomic profiles of other microorganisms without source code modifications.

show abstract

Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data

Castelli,

De Ruvo,

Bucciacchio

et al. 2023

BMC Genomics

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

High performanceLegionella pneumophilasource attribution using genomics-based machine learning classification

Cited by 1 publication

References 45 publications

Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data

Harmonization of supervised machine learning practices for efficient source attribution of Listeria monocytogenes based on genomic data

Contact Info

Product

Resources

About