Current techniques in sequencing a genome allow a service provider (e.g. a sequencing company) to have full access to the genome information, and thus the privacy of individuals regarding their lifetime secret is violated. In this paper, we introduce the problem of private DNA sequencing, where the goal is to keep the DNA sequence private to the sequencer. We propose an architecture, where the task of reading fragments of DNA and the task of DNA assembly are separated, the former is done at the sequencer(s), and the later is completed at a local trusted data collector. To satisfy the privacy constraint at the sequencer and reconstruction condition at the data collector, we create an information gap between these two relying on two techniques: (i) we use more than one non-colluding sequencer, all reporting the read fragments to the single data collector, (ii) adding the fragments of some known DNA molecules, which are still unknown to the sequencers, to the pool. We prove that these two techniques provide enough freedom to satisfy both conditions at the same time.
Biobanks store and catalog human biological material that is increasingly being digitized using next-generation sequencing (NGS). There is, however, a computational bottleneck, as existing software systems are not scalable and secure enough to store and process the incoming wave of genomic data from NGS machines. In the BiobankCloud project, we are building a Hadoop-based platform for the secure storage, sharing, and parallel processing of genomic data. We extended Hadoop to include support for multi-tenant studies, reduced storage requirements with erasure coding, and added support for extensible and consistent metadata. On top of Hadoop, we built a scalable scientific workflow engine featuring a proper workflow definition language focusing on simple integration and chaining of existing tools, adaptive scheduling on Apache Yarn, and support for iterative dataflows. Our platform also supports the secure sharing of data across different, distributed Hadoop clusters. The software is easily installed and comes with a user-friendly web interface for running, managing, and accessing data sets behind a secure 2-factor authentication. Initial tests have shown that the engine scales well to dozens of nodes. The entire system is open-source and includes pre-defined workflows for popular tasks in biomedical data analysis, such as variant identification, differential transcriptome analysis using RNA-Seq, and analysis of miRNA-Seq and ChIP-Seq data.
There is an increased amount of data produced by next generation sequencing (NGS) machines which demand scalable storage and analysis of genomic data. In order to cope with this huge amount of information, many biobanks are interested in cloud computing capabilities such as on-demand elasticity of computing power and storage capacity. There are several security and privacy requirements mandated by personal data protection legislation which hinder biobanks from migrating big data generated by the NGS machines. This paper describes the privacy requirements of platform-as-service BiobankClouds according to the European Data Protection Directive (DPD). It identifies several key privacy threats which leave BiobankClouds vulnerable to an attack. This study benefits health-care application designers in the requirement elicitation cycle when building privacy-preserving BiobankCloud platforms.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.