2008
DOI: 10.1155/2008/205969
|View full text |Cite
|
Sign up to set email alerts
|

Metagenome Fragment Classification Using N‐Mer Frequency Profiles

Abstract: A vast amount of microbial sequencing data is being generated through large-scale projects in ecology, agriculture, and human health. Efficient high-throughput methods are needed to analyze the mass amounts of metagenomic data, all DNA present in an environmental sample. A major obstacle in metagenomics is the inability to obtain accuracy using technology that yields short reads. We construct the unique N-mer frequency profiles of 635 microbial genomes publicly available as of February 2008. These profiles are… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
95
0
1

Year Published

2010
2010
2024
2024

Publication Types

Select...
5
1
1

Relationship

1
6

Authors

Journals

citations
Cited by 95 publications
(97 citation statements)
references
References 35 publications
1
95
0
1
Order By: Relevance
“…Most of the current metagenomics classification programs either suffer from slow classification speed, a large index size, or both. For example, machine-learning-based approaches such as the Naive Bayes Classifier (NBC) (Rosen et al 2008) and PhymmBL Salzberg 2009, 2011) classify <100 reads per minute, which is too slow for data sets that contain millions of reads. In contrast, the pseudoalignment approach employed in Kraken (Wood and Salzberg 2014) processes reads far more quickly, more than 1 million reads per minute, but its exact k-mer matching algorithm requires a large index.…”
mentioning
confidence: 99%
“…Most of the current metagenomics classification programs either suffer from slow classification speed, a large index size, or both. For example, machine-learning-based approaches such as the Naive Bayes Classifier (NBC) (Rosen et al 2008) and PhymmBL Salzberg 2009, 2011) classify <100 reads per minute, which is too slow for data sets that contain millions of reads. In contrast, the pseudoalignment approach employed in Kraken (Wood and Salzberg 2014) processes reads far more quickly, more than 1 million reads per minute, but its exact k-mer matching algorithm requires a large index.…”
mentioning
confidence: 99%
“…This classifier was first implemented for organism classification by Sandberg in 2001 on a small set of just 28 genomes, and has since been further extended to a larger database of 635 genomes by Rosen [9,16]. The outputted scores for each fragment are then submitted as features to an unsupervised clustering algorithm.…”
Section: Methodsmentioning
confidence: 99%
“…Most supervised classification methods for metagenomics employ either a homology-based alignment or a composition-based frequency model [8,9,10,11]. However, as mentioned above, either of these techniques can only identify the 1-2% of organisms (those that are known), and perhaps classify another 50-70% to a higher taxonomic level (such as order or phylum).…”
Section: Neural Network-based Taxonomic Clustering For Metagenomicsmentioning
confidence: 99%
“…Most of the existing clustering methods are supervised and depend on the availability of reference data for training [15,3,19,5]. A metagenome may however, contain reads from unexplored phyla which cannot be labeled into one of the existing classes.…”
Section: Related Workmentioning
confidence: 99%
“…The dominant patterns in the data are captured by its component distributions. Most mixture models assume an underlying normal distribution [19]. However, the distribution of word counts within a genome vary according to a Poisson distribution [17,18].…”
Section: Related Workmentioning
confidence: 99%