Parallelization of the Trinity Pipeline for De Novo Transcriptome Assembly

Sachdeva, Vipin; Kim, C. S.; Jordan, Kirk E.; Winn, Martyn

doi:10.1109/ipdpsw.2014.67

Cited by 6 publications

(5 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Inchworm initially creates a large hashmap table to store all unique k-mers from the input RNA-seq reads, and then it selects k-mers from the hashmap to construct linear contigs using a greedy k-mer extension approach. In our previous study [28], we confirmed that the Inchworm module of Trinity requires relatively high physical memory usage.…”

Section: Introductionsupporting

confidence: 72%

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

Kim

Winn

Sachdeva

et al. 2017

Preprint

View full text Add to dashboard Cite

BackgroundDe novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.ResultsWe introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.ConclusionsOur study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although we have focussed on the Trinity package, we propose that such clustering is a useful initial step for other assembly pipelines.

show abstract

Section: Introductionsupporting

confidence: 72%

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

Kim

Winn

Sachdeva

et al. 2017

Preprint

View full text Add to dashboard Cite

show abstract

“…Contigs from all clusters are pooled together and passed to the Chrysalis module for re-clustering according to the original Trinity scheme. The Inchworm module of Trinity is known to be the most memory-intensive step [ 28 ], and is often a barrier to processing large or complex RNA-Seq datasets. In our scheme, the computational load is passed to the pre-clustering step, where the well-established MapReduce procedure allows the load to be distributed over a commodity compute cluster.…”

Section: Discussionmentioning

confidence: 99%

“…Butterfly then reconstructs the full-length transcripts based on the de Bruijn graphs from Chrysalis, taking into account possible alternative splicing . In our previous study [ 28 ], we identified the Chrysalis module as the main bottleneck in terms of runtime, and alleviated this bottleneck by parallelising the processing over multiple compute nodes using MPI. We also confirmed that the Inchworm module of Trinity requires relatively high physical memory usage.…”

Section: Introductionmentioning

confidence: 99%

“…Using this library, we have developed software that can cluster k-mers, and then launch multiple Inchworm jobs for the resulting sub-graphs. The procedure can be linked with the rest of the Trinity pipeline, for selected components of which we have also developed an MPI-based parallelisation [ 28 ], so that the entire assembly workflow can be run on a commodity cluster. Use of the MapReduce-MPI software library [ 34 ] means that specialised MapReduce installations such as Hadoop are not required.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

et al. 2017

View full text Add to dashboard Cite

BackgroundDe novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.ResultsWe introduce a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. We implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system using the industry-standard MPI protocol, and no specialised hardware is required. Using this approach, we have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.ConclusionsOur study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although we have focussed on the Trinity package, we propose that such clustering is a useful initial step for other assembly pipelines.Electronic supplementary materialThe online version of this article (10.1186/s12859-017-1881-8) contains supplementary material, which is available to authorized users.

show abstract

“…Due to enormous volume of the data, transcriptome assembly is complex and requires a lot of computational time and resources e.g. only 10's of GB of data can take days to compute a transcriptome assembly [23] and can easily reach peta-byte level [24]. These NGS datasets have the inherent problems of storage and transmission due to their large volume and velocity.…”

Section: A Big Ngs Data and Computational Challengesmentioning

confidence: 99%

Big data proteogenomics and high performance computing: Challenges and opportunities

Saeed

2015

2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

View full text Add to dashboard Cite

Proteogenomics is an emerging field of systems biology research at the intersection of proteomics and genomics. Two high-throughput technologies, Mass Spectrometry (MS) for proteomics and Next Generation Sequencing (NGS) machines for genomics are required to conduct proteogenomics studies. Independently both MS and NGS technologies are inflicted with data deluge which creates problems of storage, transfer, analysis and visualization. Integrating these big data sets (NGS+MS) for proteogenomics studies compounds all of the associated computational problems. Existing sequential algorithms for these proteogenomics datasets analysis are inadequate for big data and high performance computing (HPC) solutions are almost non-existent. The purpose of this paper is to introduce the big data problem of proteogenomics and the associated challenges in analyzing, storing and transferring these data sets. Further, opportunities for high performance computing research community are identified and possible future directions are discussed.

show abstract

Parallelization of the Trinity Pipeline for De Novo Transcriptome Assembly

Cited by 6 publications

References 19 publications

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

K-mer clustering algorithm using a MapReduce framework: application to the parallelization of the Inchworm module of Trinity

Big data proteogenomics and high performance computing: Challenges and opportunities

Contact Info

Product

Resources

About