Synthetic DNA is durable and can encode digital data with high density, making it an attractive medium for data storage. However, recovering stored data on a large-scale currently requires all the DNA in a pool to be sequenced, even if only a subset of the information needs to be extracted. Here, we encode and store 35 distinct files (over 200 MB of data), in more than 13 million DNA oligonucleotides, and show that we can recover each file individually and with no errors, using a random access approach. We design and validate a large library of primers that enable individual recovery of all files stored within the DNA. We also develop an algorithm that greatly reduces the sequencing read coverage required for error-free decoding by maximizing information from all sequence reads. These advances demonstrate a viable, large-scale system for DNA data storage and retrieval.
Current storage technologies can no longer keep pace with exponentially growing amounts of data.1 Synthetic DNA offers an attractive alternative due to its potential information density of ~ 10 18 B/mm 3 , 10 7 times denser than magnetic tape, and potential durability of thousands of years.2 Recent advances in DNA data storage have highlighted technical challenges, in particular, coding and random access, but have stored only modest amounts of data in synthetic DNA. 3,4,5 This paper demonstrates an end-to-end approach toward the viability of DNA data storage with large-scale random access. We encoded and stored 35 distinct files, totaling 200MB of data, in more than 13 million DNA oligonucleotides (about 2 billion nucleotides in total) and fully recovered the data with no bit errors, representing an advance of almost an order of magnitude compared to prior work. 6 Our data curation focused on technologically advanced data types and historical relevance, including the Universal Declaration of Human Rights in over 100 languages, 7 a high-definition music video of the band OK Go, 8 and a CropTrust database of the seeds stored in the Svalbard Global Seed Vault. 9 We developed a random access methodology based on selective amplification, for which we designed and validated a large library of primers, and successfully retrieved arbitrarily chosen items from a subset of our pool containing 10.3 million DNA sequences. Moreover, we developed a novel coding scheme that dramatically reduces the physical redundancy (sequencing read coverage) required for error-free decoding to a median of 5x, while maintaining levels of logical redundancy comparable to the best prior codes. We further stress-tested our coding approach by successfully decoding a file using the more error-prone nanopore-based sequencing. We provide a detailed analysis of errors in the process of writing, storing, and reading data from synthetic DNA at a large scale, which helps peer-reviewed) is the author/funder. All rights reserved. No reuse allowed without permission.The copyright holder for this preprint (which was not . http://dx.doi.org/10.1101/114553 doi: bioRxiv preprint first posted online Mar. 7, 2017; 2 characterize DNA as a storage medium and justify our coding approach. Thus, we have demonstrated a significant improvement in data volume, random access, and encoding/decoding schemes that contribute to a whole-system vision for DNA data storage.Storing digital data using synthetic DNA encompasses mapping bits into nucleotide sequences, synthesizing the corresponding molecules, and storing them in an appropriate environment.Reading the information requires sequencing and converting the stored DNA back into digital data. Our project explores this DNA data storage workflow end-to-end (Figure 1a). We focus on scaling up data volumes and solving associated key challenges. Specifically, we address the need to access data selectively rather than in bulk and the need to minimize the amount of sequencing required to completely recover stored data.Most pr...
DNA has recently emerged as an attractive medium for future digital data storage because of its extremely high information density and potential longevity. Recent work has shown promising results in developing proof-of-principle prototype systems. However, very uneven (biased) sequencing coverage distributions have been reported, which indicates inefficiencies in the storage process and points to optimization opportunities. These deviations from the average coverage in oligonucleotide copy distribution result in sequence drop-out and make error-free data retrieval from DNA more challenging. The uneven copy distribution was believed to stem from the underlying molecular processes, but the interplay between these molecular processes and the copy number distribution has been poorly understood until now. In this paper, we use millions of unique sequences from a DNA-based digital data archival system to study the oligonucleotide copy unevenness problem and show that two important sources of bias are the synthesis process and the Polymerase Chain Reaction (PCR) process. By mapping the sequencing coverage of a large complex oligonucleotide pool back to its spatial distribution on the synthesis chip, we find that significant bias comes from array-based oligonucleotide synthesis. We also find that PCR stochasticity is another main driver of oligonucleotide copy variation. Based on these findings, we develop a statistical model for each molecular process as well as the overall process and compare the predicted bias with our experimental data. We further use our model to explore the trade-offs between synthesis bias, storage physical density and sequencing redundancy, providing insights for engineering efficient, robust DNA data storage systems.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.