A modularized MapReduce framework to support RNA secondary structure prediction and analysis workflows

Zhang, Boyu; Yehdego, Daniel T.; Johnson, Kyle L.; Leung, Ming-Ying; Taufer, Michela

doi:10.1109/bibmw.2012.6470251

Cited by 2 publications

(4 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The optimized method tends to cut sequences into fewer chunks, which leads to fewer map tasks and shorter MapReduce total times. This observation is different from the previous work in which the centered method results in shorter execution times due to the same reason we mentioned above [17].…”

Section: Mar Correlation Coefficients (R) and P-values (P)contrasting

confidence: 91%

“…To the best of our knowledge, our work is the first one to adapt MapReduce into secondary structure predictions of long RNA sequences. Preliminary work on the reasoning behind adapting RNA secondary structure predictions to the MapReduce paradigm can be found at [17].…”

Section: Mapreduce and Hadoopmentioning

confidence: 99%

“…RNA sequences. Preliminary work on the reasoning behind adapting RNA secondary structure predictions to the MapReduce paradigm can be found at [ 17 ].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

et al. 2013

Self Cite

View full text Add to dashboard Cite

BackgroundRibonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Their secondary structures are crucial for the RNA functionality, and the prediction of the secondary structures is widely studied. Our previous research shows that cutting long sequences into shorter chunks, predicting secondary structures of the chunks independently using thermodynamic methods, and reconstructing the entire secondary structure from the predicted chunk structures can yield better accuracy than predicting the secondary structure using the RNA sequence as a whole. The chunking, prediction, and reconstruction processes can use different methods and parameters, some of which produce more accurate predictions than others. In this paper, we study the prediction accuracy and efficiency of three different chunking methods using seven popular secondary structure prediction programs that apply to two datasets of RNA with known secondary structures, which include both pseudoknotted and non-pseudoknotted sequences, as well as a family of viral genome RNAs whose structures have not been predicted before. Our modularized MapReduce framework based on Hadoop allows us to study the problem in a parallel and robust environment.ResultsOn average, the maximum accuracy retention values are larger than one for our chunking methods and the seven prediction programs over 50 non-pseudoknotted sequences, meaning that the secondary structure predicted using chunking is more similar to the real structure than the secondary structure predicted by using the whole sequence. We observe similar results for the 23 pseudoknotted sequences, except for the NUPACK program using the centered chunking method. The performance analysis for 14 long RNA sequences from the Nodaviridae virus family outlines how the coarse-grained mapping of chunking and predictions in the MapReduce framework exhibits shorter turnaround times for short RNA sequences. However, as the lengths of the RNA sequences increase, the fine-grained mapping can surpass the coarse-grained mapping in performance.ConclusionsBy using our MapReduce framework together with statistical analysis on the accuracy retention results, we observe how the inversion-based chunking methods can outperform predictions using the whole sequence. Our chunk-based approach also enables us to predict secondary structures for very long RNA sequences, which is not feasible with traditional methods alone.

show abstract

Section: Mar Correlation Coefficients (R) and P-values (P)contrasting

confidence: 91%

Section: Mapreduce and Hadoopmentioning

confidence: 99%

See 1 more Smart Citation

Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

et al. 2013

Self Cite

View full text Add to dashboard Cite

show abstract

“…To the best of our knowledge, this work is the first one to adapt MR into secondary structure predictions of long RNA sequences. Preliminary work on the reasoning behind adapting RNA secondary structure predictions to the MR paradigm can be found at [13]. …”

Section: Background and Related Workmentioning

confidence: 99%

Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce

Yehdego

Zhang

Kodimala

et al. 2013

2013 IEEE International Symposium on Parallel &Amp; Distributed Processing, Workshops and PHD Forum

Self Cite

View full text Add to dashboard Cite

Secondary structures of ribonucleic acid (RNA) molecules play important roles in many biological processes including gene expression and regulation. Experimental observations and computing limitations suggest that we can approach the secondary structure prediction problem for long RNA sequences by segmenting them into shorter chunks, predicting the secondary structures of each chunk individually using existing prediction programs, and then assembling the results to give the structure of the original sequence. The selection of cutting points is a crucial component of the segmenting step. Noting that stem-loops and pseudoknots always contain an inversion, i.e., a stretch of nucleotides followed closely by its inverse complementary sequence, we developed two cutting methods for segmenting long RNA sequences based on inversion excursions: the centered and optimized method. Each step of searching for inversions, chunking, and predictions can be performed in parallel. In this paper we use a MapReduce framework, i.e., Hadoop, to extensively explore meaningful inversion stem lengths and gap sizes for the segmentation and identify correlations between chunking methods and prediction accuracy. We show that for a set of long RNA sequences in the RFAM database, whose secondary structures are known to contain pseudoknots, our approach predicts secondary structures more accurately than methods that do not segment the sequence, when the latter predictions are possible computationally. We also show that, as sequences exceed certain lengths, some programs cannot computationally predict pseudoknots while our chunking methods can. Overall, our predicted structures still retain the accuracy level of the original prediction programs when compared with known experimental secondary structure.

show abstract

A modularized MapReduce framework to support RNA secondary structure prediction and analysis workflows

Cited by 2 publications

References 13 publications

Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

Enhancement of accuracy and efficiency for RNA secondary structure prediction by sequence segmentation and MapReduce

Secondary Structure Predictions for Long RNA Sequences Based on Inversion Excursions and MapReduce

Contact Info

Product

Resources

About