2020
DOI: 10.1371/journal.pcbi.1007895
|View full text |Cite
|
Sign up to set email alerts
|

A systematic machine learning and data type comparison yields metagenomic predictors of infant age, sex, breastfeeding, antibiotic usage, country of origin, and delivery type

Abstract: The microbiome is a new frontier for building predictors of human phenotypes. However, machine learning in the microbiome is fraught with issues of reproducibility, driven in large part by the wide range of analytic models and metagenomic data types available. We aimed to build robust metagenomic predictors of host phenotype by comparing prediction performances and biological interpretation across 8 machine learning methods and 4 different types of metagenomic data. Using 1,570 samples from 300 infants, we fit… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2

Citation Types

1
43
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
7
2

Relationship

2
7

Authors

Journals

citations
Cited by 24 publications
(44 citation statements)
references
References 42 publications
1
43
0
Order By: Relevance
“…While several statistical analysis tools have been developed specifically for microbiome data, they are generally limited to testing for differential abundance of microbial taxa between groups of samples and do not allow users to evaluate their predictivity as they do not comprise full ML workflows for biomarker discovery [ 14 16 ]. To overcome the limitations of testing-based approaches, several researchers have explicitly built ML classifiers to distinguish case and control samples [ 17 24 ]; however, the software resulting from these studies is generally not easily modified or transferred to other classification tasks or data types. To our knowledge, a powerful yet user-friendly computational ML toolkit tailored to the characteristics of microbiome data has not yet been published.…”
Section: Introductionmentioning
confidence: 99%
“…While several statistical analysis tools have been developed specifically for microbiome data, they are generally limited to testing for differential abundance of microbial taxa between groups of samples and do not allow users to evaluate their predictivity as they do not comprise full ML workflows for biomarker discovery [ 14 16 ]. To overcome the limitations of testing-based approaches, several researchers have explicitly built ML classifiers to distinguish case and control samples [ 17 24 ]; however, the software resulting from these studies is generally not easily modified or transferred to other classification tasks or data types. To our knowledge, a powerful yet user-friendly computational ML toolkit tailored to the characteristics of microbiome data has not yet been published.…”
Section: Introductionmentioning
confidence: 99%
“…While several statistical analysis tools have been developed specifically for microbiome data, they are generally limited to testing for differential abundance of microbial taxa between groups of samples and do not allow users to evaluate their predictivity as they do not comprise full ML workflows for biomarker discovery [14][15][16]. To overcome the limitations of testing-based approaches, several researchers have explicitly built ML classifiers to distinguish case and control samples [17][18][19][20][21][22][23][24]; however, the software resulting from these studies is generally not easily modified or transferred to other classification tasks or data types. To our knowledge, a powerful yet user-friendly computational ML toolkit tailored to the characteristics of microbiome data has not yet been published.…”
Section: Introductionmentioning
confidence: 99%
“… Le Goallec et al (2020) proposed a framework for building microbiome-derived indicators of host phenotypes of infant age, sex, breastfeeding status, historical antibiotic usage, country of origin, and delivery type. By leveraging five different types of data and their combinations (host demographics (“baseline” data) and the four microbiome data type: BioCyc pathway relative abundance, Co-Abundance Groups (CAGs) relative abundance, MetaPhlAn2 taxa relative abundance, and gene relative abundance, they compared the prediction performances of 8 machine learning methods: 2 different elastic net (Elastic Net Caret and Elastic Net 2) implementations, 2 random forest (RF Caret and RF2) implementations, 2 gradient boosted machine (GBM Caret and GBM2) implementations, support vector machines (SVM, kernels: linear, polynomial of degree 2 and radial), K-nearest neighbors (KNN) and naive Bayes (NB).…”
Section: Methodsmentioning
confidence: 99%
“…In these cases, linear methods were a better choice, because of the ease of interpretation. The authors concluded that significant pairwise relationships could be detected between phenotypes and biomarkers ( Le Goallec et al, 2020 ).…”
Section: Methodsmentioning
confidence: 99%