2015
DOI: 10.1093/bioinformatics/btv683
|View full text |Cite
|
Sign up to set email alerts
|

Large-scale machine learning for metagenomics sequence classification

Abstract: Motivation: Metagenomics characterizes the taxonomic diversity of microbial communities by sequencing DNA directly from an environmental sample. One of the main challenges in metagenomics data analysis is the binning step, where each sequenced read is assigned to a taxonomic clade. Because of the large volume of metagenomics datasets, binning methods need fast and accurate algorithms that can operate with reasonable computing requirements. While standard alignment-based methods provide state-of-the-art perform… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
103
0
1

Year Published

2016
2016
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 82 publications
(105 citation statements)
references
References 31 publications
1
103
0
1
Order By: Relevance
“…Furthermore, in terms of computation, once a model has been trained using reference data, it's execution on novel sequences is generally linear and thus much less time-consuming than alignment based methods. A recent study by Vervier and co-authors [75] that explores the possibility of a largescale machine learning implementation for taxonomic assignment problem, has con rmed that compositional approaches achieve faster prediction times and consequently are appropriate for whole metagenome studies.…”
Section: Diversity Pro Ling and Taxonomic Assignmentmentioning
confidence: 99%
“…Furthermore, in terms of computation, once a model has been trained using reference data, it's execution on novel sequences is generally linear and thus much less time-consuming than alignment based methods. A recent study by Vervier and co-authors [75] that explores the possibility of a largescale machine learning implementation for taxonomic assignment problem, has con rmed that compositional approaches achieve faster prediction times and consequently are appropriate for whole metagenome studies.…”
Section: Diversity Pro Ling and Taxonomic Assignmentmentioning
confidence: 99%
“…LMAT 133 and Kraken 134 both assign taxonomy based on identified LCA taxa for k -mers in each query sequence. Other methods train models on the k -mer profiles associated with each taxon, using a variety of machine learning approaches including neural networks (TAC-ELM 135 ), naïve Bayes classifiers (RITA 136 ), or linear models-based methods 137 . TAC-ELM also incorporates data on GC content and RITA combines BLAST-based reference alignments.…”
Section: High-resolution Characterization Of the Microbiome's Functiomentioning
confidence: 99%
“…TAC-ELM also incorporates data on GC content and RITA combines BLAST-based reference alignments. Comparisons between Kraken and the linear model-based method above suggest that while exact k-mer matching methods like LMAT and Kraken are more accurate when query sequences originate from reference genomes, they may produce overly specific classifications for sequences from genomes absent from the reference database 137 . Moreover, Kraken requires fairly long (31 amino acid) k -mer matches, which may potentially reject many short reads due to insufficient data.…”
Section: High-resolution Characterization Of the Microbiome's Functiomentioning
confidence: 99%
“…Recently, machine learning based methods [169,199] are being further used for gene prediction in metagenomic fragments. Also, k-mer-based sequence binning methods and sequence property-based methods are often seen as the input for training models.…”
Section: ) Machine Learningmentioning
confidence: 99%