MESA: automated assessment of synthetic DNA fragments and simulation of DNA synthesis, storage, sequencing and PCR errors

Schwarz, Michael; Welzel, Marius; Kabdullayeva, Tolganay; Becker, Anke; Freisleben, Bernd; Heider, Dominik

doi:10.1093/bioinformatics/btaa140

Cited by 32 publications

(47 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Next, we compare several classifiers ( Fig. 3 ) including KRAKEN2, SINTAX, IDTAXA, the naïve Bayesian classifier implemented in DADA2 (DADA2-NBC), and the naïve Bayes scikit-learn classifier implemented in QIIME2 (QIIME2-NB) for their ability in accurately annotating query sequences in simQS-V3V4-i to simQS-V3V4-iii —simulated short-read data sets generated by introducing realistic error rates (∼1%) to bee-associated V3-V4 sequences (randomly sampled from the parent database BEEx-FL-refs during in silico PCR) using established Mosla Error Simulator (MESA) software ( 56 ) (see Materials and Methods section for more details).…”

Section: Resultsmentioning

confidence: 99%

“…Benchmarks performed on error-free sequence queries derived from an identical database as is being used to classify the queries is expected to result in unrealistically inflated performance rates ( 1 ). To enable more realistic testing conditions during experiments, error rates of approximately ∼1% were introduced to the sequence representatives derived from BEEx-FL-refs using established Mosla Error Simulator (MESA) software ( 56 ). Briefly, the ErrASE synthesis method was chosen with the default sequencing method set for paired-end Illumina MiSeq alongside a standard 30-cycle traditional PCR amplification step and a 12-month sample storage period.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

BEExact: a Metataxonomic Database Tool for High-Resolution Inference of Bee-Associated Microbial Communities

Daisley

Reid

2021

mSystems

View full text Add to dashboard Cite

High-throughput 16S rRNA gene sequencing technologies have robust potential to improve our understanding of bee (Hymenoptera: Apoidea)-associated microbial communities and their impact on hive health and disease. Despite recent computation algorithms now permitting exact inferencing of high-resolution exact amplicon sequence variants (ASVs), the taxonomic classification of these ASVs remains a challenge due to inadequate reference databases. To address this, we assemble a comprehensive data set of all publicly available bee-associated 16S rRNA gene sequences, systematically annotate poorly resolved identities via inclusion of 618 placeholder labels for uncultivated microbial dark matter, and correct for phylogenetic inconsistencies using a complementary set of distance-based and maximum likelihood correction strategies. To benchmark the resultant database (BEExact), we compare performance against all existing reference databases in silico using a variety of classifier algorithms to produce probabilistic confidence scores. We also validate realistic classification rates on an independent set of ∼234 million short-read sequences derived from 32 studies encompassing 50 different bee types (36 eusocial and 14 solitary). Species-level classification rates on short-read ASVs range from 80 to 90% using BEExact (with ∼20% due to “bxid” placeholder names), whereas only ∼30% at best can be resolved with current universal databases. A series of data-driven recommendations are developed for future studies. We conclude that BEExact (https://github.com/bdaisley/BEExact) enables accurate and standardized microbiota profiling across a broad range of bee species—two factors of key importance to reproducibility and meaningful knowledge exchange within the scientific community that together, can enhance the overall utility and ecological relevance of routine 16S rRNA gene-based sequencing endeavors. IMPORTANCE The failure of current universal taxonomic databases to support the rapidly expanding field of bee microbiota research has led to many investigators relying on “in-house” reference sets or manual classification of sequence reads (usually based on BLAST searches), often with vague identity thresholds and subjective taxonomy choices. This time-consuming, error- and bias-prone process lacks standardization, cripples the potential for comparative cross-study analysis, and in many cases is likely to incorrectly sway study conclusions. BEExact is structured on and leverages several complementary bioinformatic techniques to enable refined inference of bee host-associated microbial communities without any other methodological modifications necessary. It also bridges the gap between current practical outcomes (i.e., phylotype-to-genus level constraints with 97% operational taxonomic units [OTUs]) and the theoretical resolution (i.e., species-to-strain level classification with 100% ASVs) attainable in future microbiota investigations. Other niche habitats could also likely benefit from customized database curation via implementation of the novel approaches introduced in this study.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

BEExact: a Metataxonomic Database Tool for High-Resolution Inference of Bee-Associated Microbial Communities

Daisley

Reid

2021

mSystems

View full text Add to dashboard Cite

show abstract

“…The highest customizability is obtained by using the MESA [ 32 ] API. Since MESA as a web tool for the automated assessment of synthetic DNA fragments and simulation of DNA synthesis, storage, sequencing, and PCR errors does not only allow user-defined configurations but also offers a REST-API, MESA allows a fine-grained and correct assessment of error probabilities per packet.…”

Section: Methodsmentioning

confidence: 99%

NOREC4DNA: using near-optimal rateless erasure codes for DNA storage

Schwarz

Freisleben

2021

BMC Bioinformatics

Self Cite

View full text Add to dashboard Cite

Background DNA is a promising storage medium for high-density long-term digital data storage. Since DNA synthesis and sequencing are still relatively expensive tasks, the coding methods used to store digital data in DNA should correct errors and avoid unstable or error-prone DNA sequences. Near-optimal rateless erasure codes, also called fountain codes, are particularly interesting codes to realize high-capacity and low-error DNA storage systems, as shown by Erlich and Zielinski in their approach based on the Luby transform (LT) code. Since LT is the most basic fountain code, there is a large untapped potential for improvement in using near-optimal erasure codes for DNA storage. Results We present NOREC4DNA, a software framework to use, test, compare, and improve near-optimal rateless erasure codes (NORECs) for DNA storage systems. These codes can effectively be used to store digital information in DNA and cope with the restrictions of the DNA medium. Additionally, they can adapt to possible variable lengths of DNA strands and have nearly zero overhead. We describe the design and implementation of NOREC4DNA. Furthermore, we present experimental results demonstrating that NOREC4DNA can flexibly be used to evaluate the use of NORECs in DNA storage systems. In particular, we show that NORECs that apparently have not yet been used for DNA storage, such as Raptor and Online codes, can achieve significant improvements over LT codes that were used in previous work. NOREC4DNA is available on https://github.com/umr-ds/NOREC4DNA. Conclusion NOREC4DNA is a flexible and extensible software framework for using, evaluating, and comparing NORECs for DNA storage systems.

show abstract

“…Deoxyribonucleic acid sequences containing consecutive repetitive subsequences are more likely to be misaligned during sequencing and this results in data-reading errors (Myers, 2007 Tandem Repeats and Morphological Variation | Learn Science at Scitable). Sequences containing consecutive repetitive subsequences easily produce polymerase slippage at the synthesis phase (Schwarz et al, 2020). Two DNA sequences can easily become dislocated in the repetitive region.…”

Section: Non-adjacent Subsequencementioning

confidence: 99%

“…Therefore, it is vital to study the sources of errors that impact DNA storage and coding. Earlier studies ( Myers, 2007 Tandem Repeats and Morphological Variation | Learn Science at Scitable; Kovacevic and Tan, 2018 ; Schwarz et al, 2020 ) revealed that the error rate in the storage process increases if there are consecutive repetitive subsequences in the sequence. Hence, we propose a novel constraint (non-adjacent subsequence constraint) to avoid the occurrence of this sequence.…”

Section: Introductionmentioning

confidence: 99%

CLGBO: An Algorithm for Constructing Highly Robust Coding Sets for DNA Storage

Zheng

Wang

2021

Front. Genet.

View full text Add to dashboard Cite

In the era of big data, new storage media are urgently needed because the storage capacity for global data cannot meet the exponential growth of information. Deoxyribonucleic acid (DNA) storage, where primer and address sequences play a crucial role, is one of the most promising storage media because of its high density, large capacity and durability. In this study, we describe an enhanced gradient-based optimizer that includes the Cauchy and Levy mutation strategy (CLGBO) to construct DNA coding sets, which are used as primer and address libraries. Our experimental results show that the lower bounds of DNA storage coding sets obtained using the CLGBO algorithm are increased by 4.3–13.5% compared with previous work. The non-adjacent subsequence constraint was introduced to reduce the error rate in the storage process. This helps to resolve the problem that arises when consecutive repetitive subsequences in the sequence cause errors in DNA storage. We made use of the CLGBO algorithm and the non-adjacent subsequence constraint to construct larger and more highly robust coding sets.

show abstract

MESA: automated assessment of synthetic DNA fragments and simulation of DNA synthesis, storage, sequencing and PCR errors

Cited by 32 publications

References 23 publications

BEExact: a Metataxonomic Database Tool for High-Resolution Inference of Bee-Associated Microbial Communities

BEExact: a Metataxonomic Database Tool for High-Resolution Inference of Bee-Associated Microbial Communities

NOREC4DNA: using near-optimal rateless erasure codes for DNA storage

CLGBO: An Algorithm for Constructing Highly Robust Coding Sets for DNA Storage

Contact Info

Product

Resources

About