Review of General Algorithmic Features for Genome Assemblers for Next Generation Sequencers

Wajid, Bilal; Serpedin, Erchin

doi:10.1016/j.gpb.2012.05.006

Cited by 38 publications

(31 citation statements)

References 66 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The layout helps in producing a consensus sequence, where each base in the sequence is identified by simple majority amongst the bases at that position or via some probabilistic approach. Therefore, this “Alignment-Layout-Consensus” paradigm is used by genome assemblers to infer the novel genome, [27-35]. …”

Section: Methodsmentioning

confidence: 99%

“…It begins the process by identifying a model, the “reference sequences”, most closely related to the set of reads. It then uses the set of reads to build on this model producing a model which overfits the data, the “novel genome”, [27,28,34,36-41]. The task of MDL is to identify the model that best describes the data and within comparative assembly framework the same meaning applies to finding the reference sequences that best describes the set of reads.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Optimal reference sequence selection for genome assembly using minimum description length principle

Wajid

Serpedin

Nounou

et al. 2012

J Bioinform Sys Biology

Self Cite

View full text Add to dashboard Cite

Reference assisted assembly requires the use of a reference sequence, as a model, to assist in the assembly of the novel genome. The standard method for identifying the best reference sequence for the assembly of a novel genome aims at counting the number of reads that align to the reference sequence, and then choosing the reference sequence which has the highest number of reads aligning to it. This article explores the use of minimum description length (MDL) principle and its two variants, the two-part MDL and Sophisticated MDL, in identifying the optimal reference sequence for genome assembly. The article compares the MDL based proposed scheme with the standard method coming to the conclusion that “counting the number of reads of the novel genome present in the reference sequence” is not a sufficient condition. Therefore, the proposed MDL scheme includes within itself the standard method of “counting the number of reads that align to the reference sequence” and also moves forward towards looking at the model, the reference sequence, as well, in identifying the optimal reference sequence. The proposed MDL based scheme not only becomes the sufficient criterion for identifying the optimal reference sequence for genome assembly but also improves the reference sequence so that it becomes more suitable for the assembly of the novel genome.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Optimal reference sequence selection for genome assembly using minimum description length principle

Wajid

Serpedin

Nounou

et al. 2012

J Bioinform Sys Biology

Self Cite

View full text Add to dashboard Cite

show abstract

“…For an extensive literature on assemblers consult [54,[69][70][71][72][73]. The list of aligners is updated online [53].…”

Section: Platform-specific Biasesmentioning

confidence: 99%

Computational Errors and Biases in Short Read Next Generation Sequencing

Abnizova¹,

Boekhorst²,

Orlov³

2017

J Proteomics Bioinform

View full text Add to dashboard Cite

“…N-gram based models have been widely used in natural language processing [11][12][13] and bioinformatics [14,15] due to their performance and ease of implementation. In this study, we only use uni-gram features and bi-gram features.…”

Section: B Using N-gram Models To Learn Associations Betweenmentioning

confidence: 99%

DeepDeath: Learning to predict the underlying cause of death with Big Data

Hassanzadeh

Sha

Wang

2017

2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)

View full text Add to dashboard Cite

Abstract-Multiple cause-of-death data provides a valuable source of information that can be used to enhance health standards by predicting health related trajectories in societies with large populations. These data are often available in large quantities across U.S. states and require Big Data techniques to uncover complex hidden patterns. We design two different classes of models suitable for large-scale analysis of mortality data, a Hadoop-based ensemble of random forests trained over N-grams, and the DeepDeath, a deep classifier based on the recurrent neural network (RNN). We apply both classes to the mortality data provided by the National Center for Health Statistics and show that while both perform significantly better than the random classifier, the deep model that utilizes long short-term memory networks (LSTMs), surpasses the N-gram based models and is capable of learning the temporal aspect of the data without a need for building ad-hoc, expert-driven features.

show abstract

Review of General Algorithmic Features for Genome Assemblers for Next Generation Sequencers

Cited by 38 publications

References 66 publications

Optimal reference sequence selection for genome assembly using minimum description length principle

Optimal reference sequence selection for genome assembly using minimum description length principle

Computational Errors and Biases in Short Read Next Generation Sequencing

DeepDeath: Learning to predict the underlying cause of death with Big Data

Contact Info

Product

Resources

About