Separating Metagenomic Short Reads into Genomes via Clustering

Tanaseichuk, Olga; Borneman, James; Jiang, Tao

doi:10.1007/978-3-642-23038-7_25

Cited by 6 publications

(3 citation statements)

References 34 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…AbundanceBin ( Wu and Ye, 2011 ) groups reads based on Observation (A) but fails when the species in the sample have similar abundance. TOSS ( Tanaseichuk et al , 2011 ) bins reads based on Observations (A) and (B), and since TOSS relies on AbundanceBin to handle genomes with different abundances, it carries all the shortcomings of AbundanceBin. MetaCluster 4.0 ( Wang et al , 2012 ) has three phases: Phase 1 groups reads together based on Observation (B); Phase 2 derives the q -mer distribution of each group and Phase 3 merges the groups of reads based on Observation (C) by the well-known K -means clustering approach.…”

Section: Introductionmentioning

confidence: 99%

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample

et al. 2012

View full text Add to dashboard Cite

Motivation: Metagenomic binning remains an important topic in metagenomic analysis. Existing unsupervised binning methods for next-generation sequencing (NGS) reads do not perform well on (i) samples with low-abundance species or (ii) samples (even with high abundance) when there are many extremely low-abundance species. These two problems are common for real metagenomic datasets. Binning methods that can solve these problems are desirable.Results: We proposed a two-round binning method (MetaCluster 5.0) that aims at identifying both low-abundance and high-abundance species in the presence of a large amount of noise due to many extremely low-abundance species. In summary, MetaCluster 5.0 uses a filtering strategy to remove noise from the extremely low-abundance species. It separate reads of high-abundance species from those of low-abundance species in two different rounds. To overcome the issue of low coverage for low-abundance species, multiple w values are used to group reads with overlapping w-mers, whereas reads from high-abundance species are grouped with high confidence based on a large w and then binning expands to low-abundance species using a relaxed (shorter) w. Compared to the recent tools, TOSS and MetaCluster 4.0, MetaCluster 5.0 can find more species (especially those with low abundance of say 6× to 10×) and can achieve better sensitivity and specificity using less memory and running time.Availability: http://i.cs.hku.hk/~alse/MetaCluster/Contact: chin@cs.hku.hk

show abstract

Section: Introductionmentioning

confidence: 99%

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample

et al. 2012

View full text Add to dashboard Cite

show abstract

“…Sequence similarities are typically identified by comparing occurence patterns of relatively short DNA substrings of length l between the sequences [50,55]. Two broad scenarii can be used to assess l-mer-based similarities: abundance-based methods make use of relatively large l values (l ≥ 20) in order to ensure the uniqueness of most l-mers [50], while composition-based methods rely on smaller l values. Since DNA is a combination of four different types of nucleotides (A,T,G,C), there are at most 4 l l-mer combinations forming the feature vector.…”

Section: L-mer Frequency Calculationmentioning

confidence: 99%

Shared Nearest Neighbor clustering in a Locality Sensitive Hashing framework

Kanj¹,

Brüls²,

Gazut³

2016

Preprint

View full text Add to dashboard Cite

We present a new algorithm to cluster high dimensional sequence data, and its application to the field of metagenomics, which aims to reconstruct individual genomes from a mixture of genomes sampled from an environmental site, without any prior knowledge of reference data (genomes) or the shape of clusters. Such problems typically cannot be solved directly with classical approaches seeking to estimate the density of clusters, e.g., using the shared nearest neighbors rule, due to the prohibitive size of contemporary sequence datasets. We explore here a new method based on combining the shared nearest neighbor (SNN) rule with the concept of Locality Sensitive Hashing (LSH). The proposed method, called LSH-SNN, works by randomly splitting the input data into smaller-sized subsets (buckets) and employing the shared nearest neighbor rule on each of these buckets. Links can be created among neighbors sharing a sufficient number of elements, hence allowing clusters to be grown from linked elements. LSH-SNN can scale up to larger datasets consisting of millions of sequences, while achieving high accuracy across a variety of sample sizes and complexities. * T. Brüls works at Shared Nearest Neighbor clustering in a Locality Sensitive Hashing framework bias is known as the uniform effect of the K-means. Moreover, the number of clusters K has to be specified a priori, which is not trivial when no prior knowledge is available. To address these problems, methods based on estimating the density and/or the similarity among instances have been introduced [15,30].In [14], the authors presented an effective clustering method based on two key notions: the similarity between neighboring elements and the density around instances. This method, Shared Nearest Neighbors (SNN), is a density-based clustering method and incorporates a suitable similarity measure to cluster data. After finding the nearest neighbors of each element and computing the similarity between pairs of points, SNN identifies core points, eliminates noisy elements and builds clusters around the core elements. This method can yield better performance compared to other clustering approaches with data of varying densities, and it can automatically handle the number of output clusters. However, this method has complexity O(n 2 ), where n is the number of instances in the dataset, arising from the computation of the similarity matrix, which can be prohibitive when dealing with high dimensional data.One interesting concept to reduce the burden of computing the similarity matrix is Locality Sensitive Hashing (LSH). This concept was initially introduced to find approximate near neighbor information in high dimensional space [19,51].The key idea is to hash elements into different buckets; then for a query instance x, to use instances stored in buckets containing x as candidates for near neighbors. This approximation reduces the query time complexity to O(log n) instead of O(n) (O(n) is the complexity for searching nearest neighbors for one instance). Therefore, the similarity m...

show abstract

“…Composite genomes can be amassed from metagenomic contigs by classifying (or 'binning') reads according to the abundance of related reads and lineage-specific signatures such as nucleotide content signatures (Tyson et al, 2004;Woyke et al, 2006;Dick et al, 2009;Hess et al, 2011;Luo et al, 2011;Tanaseichuk et al, 2011;Wang et al, 2012b;Fig. 1a).…”

Section: Introductionmentioning

confidence: 99%

The future is now: single-cell genomics of bacteria and archaea

2013

View full text Add to dashboard Cite

Interest in the expanding catalog of uncultivated microorganisms, increasing recognition of heterogeneity among seemingly similar cells, and technological advances in whole-genome amplification and single-cell manipulation are driving considerable progress in single-cell genomics. Here, the spectrum of applications for single-cell genomics, key advances in the development of the field, and emerging methodology for single-cell genome sequencing are reviewed by example with attention to the diversity of approaches and their unique characteristics. Experimental strategies transcending specific methodologies are identified and organized as a road map for future studies in single-cell genomics of environmental microorganisms. Over the next decade, increasingly powerful tools for single-cell genome sequencing and analysis will play key roles in accessing the genomes of uncultivated organisms, determining the basis of microbial community functions, and fundamental aspects of microbial population biology.

show abstract

Separating Metagenomic Short Reads into Genomes via Clustering

Cited by 6 publications

References 34 publications

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample

MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample

Shared Nearest Neighbor clustering in a Locality Sensitive Hashing framework

The future is now: single-cell genomics of bacteria and archaea

Contact Info

Product

Resources

About