DNA Sequence Assembly and Multiple Sequence Alignment by an Eulerian Path Approach

Zhang, Y; Waterman, Michael S.

doi:10.1101/sqb.2003.68.205

Cited by 6 publications

(4 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Although de Bruijn graph approaches are currently being used primarily for the purposes of assembly, they are a generally useful formalism for sequence analysis. In particular, they have been extended to efficient multiple-sequence alignment, repeat discovery, and detection of local and structural sequence variation (29,(32)(33)(34).…”

Section: Discussionmentioning

confidence: 99%

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Pell

Hintze

Canino-Koning

et al. 2012

Proc. Natl. Acad. Sci. U.S.A.

226

203

View full text Add to dashboard Cite

Deep sequencing has enabled the investigation of a wide range of environmental microbial ecosystems, but the high memory requirements for de novo assembly of short-read shotgun sequencing data from these complex populations are an increasingly large practical barrier. Here we introduce a memory-efficient graph representation with which we can analyze the k-mer connectivity of metagenomic samples. The graph representation is based on a probabilistic data structure, a Bloom filter, that allows us to efficiently store assembly graphs in as little as 4 bits per k-mer, albeit inexactly. We show that this data structure accurately represents DNA assembly graphs in low memory. We apply this data structure to the problem of partitioning assembly graphs into components as a prelude to assembly, and show that this reduces the overall memory requirements for de novo assembly of metagenomes. On one soil metagenome assembly, this approach achieves a nearly 40-fold decrease in the maximum memory requirements for assembly. This probabilistic graph representation is a significant theoretical advance in storing assembly graphs and also yields immediate leverage on metagenomic assembly. metagenomics | compression D e novo assembly of shotgun sequencing reads into longer contiguous sequences plays an important role in virtually all genomic research (1). However, current computational methods for sequence assembly do not scale well to the volume of sequencing data now readily available from next-generation sequencing machines (1, 2). In particular, the deep sequencing required to sample complex microbial environments easily results in datasets that surpass the working memory of available computers (3, 4).Deep sequencing and assembly of short reads is particularly important for the sequencing and analysis of complex microbial ecosystems, which can contain millions of different microbial species (5, 6). These ecosystems mediate important biogeochemical processes but are still poorly understood at a molecular level, in large part because they consist of many microbes that cannot be cultured or studied individually in the lab (5, 7). Ensemble sequencing ("metagenomics") of these complex environments is one of the few ways to render them accessible, and has resulted in substantial early progress in understanding the microbial composition and function of the ocean, human gut, cow rumen, and permafrost soil (3,4,8,9). However, as sequencing capacity grows, the assembly of sequences from these complex samples has become increasingly computationally challenging. Current methods for short-read assembly rely on inexact data reduction in which reads from low-abundance organisms are discarded, biasing analyses towards high-abundance organisms (3, 4, 9).The predominant assembly formalism applied to short-read sequencing datasets is a de Bruijn graph (10-12). In a de Bruijn graph approach, sequencing reads are decomposed into fixedlength words, or k-mers, and used to build a connectivity graph. This graph is then traversed to determine contiguous...

show abstract

Section: Discussionmentioning

confidence: 99%

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Pell

Hintze

Canino-Koning

et al. 2012

Proc. Natl. Acad. Sci. U.S.A.

226

203

View full text Add to dashboard Cite

show abstract

“…The process of assembly is then achieved by finding the Eulerian path in the de Bruijn graph. [8][9][10] Because of the benefit from de bruijn graph, in recent years, many software programs have been developed to merging the short reads fragment assembly in order to form the original sequences based on this approach as well. The most popular problems in sequencing technologies are to handle large amount of data produced as well as usually these short reads of DNA fragments have many overlaps.…”

Section: Introductionmentioning

confidence: 99%

“…Eulerian path is path which visits every edge exactly once. [2,10,13,14] A. Graph Construction The design methodology of this research is shown in figure below. It is realized by derivation of Verilog HDL code while the development is done in Xilinx ISE Design Suite 14.2.…”

Section: Introductionmentioning

confidence: 99%

Design and development of DNA fragment assembly using IWP method

Hassan

Majid

Halim

et al. 2013

2013 IEEE 4th Control and System Graduate Research Colloquium

View full text Add to dashboard Cite

From time to time, researchers always try to improve algorithm for DNA fragment assembly that using de Bruijn graph. A de Bruijn graph is one of graph theoretical approach that based on short words (k-mers) that is ideal for high coverage, very short read (25-50bp) data sets. Therefore, the content of this paper proposed the development of DNA fragment assembly by using one of method that applied de Bruijn graph to construct complete a sequence, called Idury Waterman and Pevzner method. The algorithm was developed using Verilog HDL in Xilinx ISE Design Suite 14.2. The simulation used VCS Synopsys tool. The simulation result shows that it is tally with the theoretical analysis and was presented well in this paper.

show abstract

“…Τα δίκτυα αποτελούν πλέον σημαντικό εργαλείο και στο πεδίο της βιοπληροφορικής. Η εφαρμογή τους περιλαμβάνει μεταξύ άλλων τα δίκτυα γονιδίων και πρωτεϊνών (Goldberg et al, 2007), τη χρήση δικτύων με μορφή δένδρων για τη μελέτη της εξέλιξης (Huson and Bryant, 2006), τη μελέτη πρωτεϊνικών δομών (Amitai et al, 2004) και τη στοίχιση ακολουθιών (Zhang and Waterman, 2003).…”

Section: σχήμα 11 παράδειγμα μη κατευθυνόμενου δικτύου χωρίς βάρη με 6 κορυφέςunclassified

Ανάπτυξη Υβριδικών Αλγορίθμων Με Βάση Μεθοδολογίες Δικτύων Για Τη Διερεύνηση Συσχετίσεων Σε Βιολογικά/Περιβαλλοντικά Δεδομένα

Βαλαβάνης¹

View full text Add to dashboard Cite

Στην παρούσα διατριβή προτείνονται και αναπτύσσονται αλγόριθμοι βασισμένοι σε δίκτυα για την επεξεργασία και ανάλυση βιολογικών/περιβαλλοντικών δεδομένων με κύριο σκοπό τη διερεύνηση συσχετίσεων σε αυτά. Συγκεκριμένα, οι αλγόριθμοι που αναπτύσσονται χρησιμοποιούνται για την ανάλυση και επεξεργασία (i) πρωτεϊνικών δεδομένων με στόχο την ανάλυση του χώρου των δομών και ακολουθιών και τη συνεισφορά στην αναγνώριση διπλώματος των πρωτεϊνών, (ii) δεδομένων που προκύπτουν από τη γενετική ταυτότητα ατόμων και περιβαλλοντικές παραμέτρους με σκοπό την αιτιολογική ανάλυση πολυπαραγοντικών φαινοτύπων που σχετίζονται με τις καρδιαγγειακές νόσους. Στο πρώτο μέρος της διατριβής χρησιμοποιούνται βασικές αρχές δικτύων για τη μελέτη της τοπολογίας δικτύων ομοιότητας πρωτεϊνών σε επίπεδο δομής και ακολουθίας. Σε επίπεδο ακολουθίας τα δίκτυα ομοιότητας κατασκευάζονται με χρήση της απόστασης διανυσμάτων χαρακτηριστικών εξαγόμενων από την ακολουθία, ενώ σε επίπεδο δομής με χρήση του βαθμού ομοιότητας που προκύπτει από τη δομική τους στοίχιση. Τα αποτελέσματα της ανάλυσης των δικτύων συνδέονται με εξελικτική πληροφορία των πρωτεϊνών, ενώ αξιολογείται η πληροφορία που περιέχουν τα εξαγόμενα από την ακολουθία χαρακτηριστικά σε σχέση με την πρωτεϊνική δομή. Με βάση το δίκτυο ομοιότητας σε επίπεδο ακολουθίας, κατασκευάζεται ταξινομητής που υπολογίζει τη συγγένεια πρωτεϊνικής ακολουθίας με ακολουθίες γνωστού διπλώματος και χρησιμοποιείται για την αναγνώριση διπλώματος. Το δεύτερο μέρος της εργασίας αφορά στον προσδιορισμό παραγόντων (φύλου, ηλικίας, γενετικών πολυμορφισμών, κλινικών μετρήσεων και διατροφικών συνηθειών) που αλληλεπιδρούν και συνδυαστικά επηρεάζουν την επικινδυνότητα ανάπτυξης καρδιαγγειακών νόσων. Αναλύονται δυο διαφορετικά διαθέσιμα σύνολα δεδομένων στα οποία η ποσοτικοποίηση της επικινδυνότητας βασίζεται στους φαινοτύπους της μεταγευματικής λιπαιμίας και της παχυσαρκίας, αντίστοιχα. Η μεθοδολογία που αναπτύσσεται βασίζεται στη χρήση τεχνητών νευρωνικών δικτύων σε συνδυασμό με τη μέθοδο της όπισθεν επιλογής χαρακτηριστικών και γενετικό αλγόριθμο για την επιλογή των σημαντικών παραγόντων και συνδυασμών τους. Η εφαρμογή των υβριδικών μεθόδων οδήγησε στο προσδιορισμό των βέλτιστων υποσυνόλων παραγόντων που επηρεάζουν τους υπό μελέτη φαινοτύπους, καθώς και σε αντίστοιχους ταξινομητές τεχνητού νευρωνικού δικτύου με ικανοποιητική ικανότητα γενίκευσης σε άγνωστα δεδομένα.

show abstract

DNA Sequence Assembly and Multiple Sequence Alignment by an Eulerian Path Approach

Cited by 6 publications

References 8 publications

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Scaling metagenome sequence assembly with probabilistic de Bruijn graphs

Design and development of DNA fragment assembly using IWP method

Ανάπτυξη Υβριδικών Αλγορίθμων Με Βάση Μεθοδολογίες Δικτύων Για Τη Διερεύνηση Συσχετίσεων Σε Βιολογικά/Περιβαλλοντικά Δεδομένα

Contact Info

Product

Resources

About