A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes

Saeed, Fahad; Perez-Rathke, Alan; Gwarnicki, Jaroslaw; Berger-Wolf, Tanya Y.; Khokhar, Ashfaq

doi:10.1016/j.jpdc.2011.08.001

Cited by 11 publications

(9 citation statements)

References 48 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Then, we monitor several performance metrics, such as the memory consumption, the amount of network communications, and the computational overhead of the alignment algorithms. Previous work showed that providing high alignment performance is critical for an alignment algorithm to be adopted by the bio-medical community [41].…”

Section: A Experimental Settingsmentioning

confidence: 99%

MaskAl: Privacy Preserving Masked Reads Alignment using Intel SGX

Lambert

Fernandes

Decouchant

et al. 2018

2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS)

View full text Add to dashboard Cite

The recent introduction of new DNA sequencing techniques caused the amount of processed and stored biological data to skyrocket. In order to process these vast amounts of data, bio-centers have been tempted to use low-cost public clouds. However, genomes are privacy sensitive, since they store personal information about their donors, such as their identity, disease risks, heredity and ethnic origin. The first critical DNA processing step that can be executed in a cloud, i.e., read alignment, consists in finding the location of the DNA sequences produced by a sequencing machine in the human genome. While recent developments aim at increasing performance, only few approaches address the need for fast and privacy preserving read alignment methods. This paper introduces MaskAl, a novel approach for read alignment. MaskAl combines a fast preprocessing step on raw genomic data-filtering and masking-with established algorithms to align sanitized reads, from which sensitive parts have been masked out, and refines the alignment score using the masked out information with Intel's software guard extensions (SGX). MaskAl is a highly competitive privacy-preserving read alignment software that can be massively parallelized with public clouds and emerging enclave clouds. Finally, MaskAl is nearly as accurate as plain-text approaches (more than 96% of aligned reads with MaskAl compared to 98% with BWA) and can process alignment workloads 87% faster than current privacy-preserving approaches while using less memory and network bandwidth.

show abstract

Section: A Experimental Settingsmentioning

confidence: 99%

MaskAl: Privacy Preserving Masked Reads Alignment using Intel SGX

Lambert

Fernandes

Decouchant

et al. 2018

2018 IEEE 37th Symposium on Reliable Distributed Systems (SRDS)

View full text Add to dashboard Cite

show abstract

“…This gives rise to the field of proteogenomics. The most effective and high-throughput tools for studying genomics and proteomics are next generation sequencing machines (NGS) [18] and mass spectrometers (MS) [19], respectively. Proteogenomics requires integration and analysis of data from both of these high-throughput technologies.…”

Section: Background Informationmentioning

confidence: 99%

“…These machines produce short fragments of DNA or RNA sequences called reads. The sheer volume of data from these machines (3 billion DNA/RNA reads and 0.6TB per run [21]) needs efficient and high-performance computational tools [22] [18]. In order to process the genomic data it is usually mapped to the reference genome.…”

Section: A Big Ngs Data and Computational Challengesmentioning

confidence: 99%

Big data proteogenomics and high performance computing: Challenges and opportunities

Saeed

2015

2015 IEEE Global Conference on Signal and Information Processing (GlobalSIP)

Self Cite

View full text Add to dashboard Cite

Proteogenomics is an emerging field of systems biology research at the intersection of proteomics and genomics. Two high-throughput technologies, Mass Spectrometry (MS) for proteomics and Next Generation Sequencing (NGS) machines for genomics are required to conduct proteogenomics studies. Independently both MS and NGS technologies are inflicted with data deluge which creates problems of storage, transfer, analysis and visualization. Integrating these big data sets (NGS+MS) for proteogenomics studies compounds all of the associated computational problems. Existing sequential algorithms for these proteogenomics datasets analysis are inadequate for big data and high performance computing (HPC) solutions are almost non-existent. The purpose of this paper is to introduce the big data problem of proteogenomics and the associated challenges in analyzing, storing and transferring these data sets. Further, opportunities for high performance computing research community are identified and possible future directions are discussed.

show abstract

“…3) Load Balancing: Load balancing is one of the most important attributes necessary for performance of a parallel algorithm [25], [26]. Load balancing is important because it ensures that the processors/cores are busy for most of the time the program is running.…”

Section: ) Parallelizing the Main Loop In Algorithmmentioning

confidence: 99%

A high performance algorithm for clustering of large-scale protein mass spectrometry data using multi-core architectures

Saeed

Hoffert

Knepper

2013

Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining

Self Cite

View full text Add to dashboard Cite

High-throughput mass spectrometers can produce thousands of peptide spectra from a single complex protein sample in a short amount of time. These data sets contain a substantial amount of redundancy (i.e. the same peptide is selected and identified multiple times in a single experiment) from peptides that may get selected multiple times in the liquid chromatography mass spectrometry (LC-MS/MS) experiment. The data from these mass spectrometers contain a substantial number of spectra that have low signal to noise (S/N) ratio and may not get interpreted due to poor quality. Recently, we presented a graph theoretic algorithm, CAMS (Clustering Algorithm for Mass Spectra) for clustering mass spectrometry data. CAMS utilized a novel metric, called a F-set, that allows accurate identification of the spectra that are similar with much higher accuracy and sensitivity than if single peak comparisons were performed. In this paper we present a multithreaded algorithm, called P-CAMS, for clustering of mass spectral data on multicore machines. The algorithm relies on intelligent matrix completion for graph construction and a load-balancing scheme for substantial speedups. We study the scalability performance of the proposed parallel algorithm on a multicore machine using synthetically generated spectra with parameters carefully chosen to mimic real-world mass spectrometry datasets. Real experimental datasets were also generated for quality assessment of the clustering results from the proposed algorithm. The results show that the proposed algorithms have scalable runtime performances and gives clustering results similar to a serial algorithm. The study also provides insight into the design of high performance algorithms for irregular problems in proteomics on many-core architectures.

show abstract

A high performance multiple sequence alignment system for pyrosequencing reads from multiple reference genomes

Cited by 11 publications

References 48 publications

MaskAl: Privacy Preserving Masked Reads Alignment using Intel SGX

MaskAl: Privacy Preserving Masked Reads Alignment using Intel SGX

Big data proteogenomics and high performance computing: Challenges and opportunities

A high performance algorithm for clustering of large-scale protein mass spectrometry data using multi-core architectures

Contact Info

Product

Resources

About