Accuracy of <i>de novo</i> assembly of DNA sequences from double-digest libraries varies substantially among software

LaCava, Melanie E. F.; Aikens, Ellen O.; Megna, Libby C.; Randolph, Gregg; Hubbard, Charley J.; Buerkle, C. Alex

doi:10.1101/706531

Cited by 10 publications

(19 citation statements)

References 41 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CD‐HIT was developed for assembling protein sequences, but was later extended for nucleotide sequences, and it is used in the analysis pipeline dDocent (Puritz et al, ). We used dDocent's data reduction step that retains only one copy of each unique sequence for assembly to reduce computational time (this script can be found in the Dryad repository: https://doi.org/10.5061/dryad.8tr03f8, LaCava et al, ).…”

Section: Methodsmentioning

confidence: 99%

“…We used indsdemultiplexing.py to demultiplex the simulated reads and readstrim.py to trim the barcodes from the reads, resulting in reads of 94 bp that all began with the EcoRI restriction site. Our wrapper functions for each simulation, modified from run_ddradseq_chain.sh in ddRADseqTools, are included in the Dryad repository: https://doi.org/10.5061/dryad.8tr03f8, LaCava et al ().…”

Section: Methodsmentioning

confidence: 99%

“…We constructed GBS assemblies using each assembler for the nine simulated data sets for A. thaliana and H. sapiens . We used a custom Perl script to compare the assembled contigs with the known fragments from the in‐silico digestion of genomes to determine assembly accuracy using two metrics (this script can be found in the Dryad repository: https://doi.org/10.5061/dryad.8tr03f8, LaCava et al, ). To evaluate how completely each assembler recovered the original genome fragments (loci), we counted the number of true genome fragments that were represented in the assembly (completeness criterion).…”

Section: Methodsmentioning

confidence: 99%

“…Simulated reads, assembler outputs, and scripts for simulations, assembly, and analysis are available from the Dryad Digital Repository: https://doi.org/10.5061/dryad.8tr03f8 (LaCava et al, ).…”

Section: Data Availability Statementmentioning

confidence: 99%

See 3 more Smart Citations

Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software

LaCava

Aikens

Megna

et al. 2019

Molecular Ecology Resources

Self Cite

View full text Add to dashboard Cite

Advances in DNA sequencing have made it feasible to gather genomic data for non‐model organisms and large sets of individuals, often using methods for sequencing subsets of the genome. Several of these methods sequence DNA associated with endonuclease restriction sites (various RAD and GBS methods). For use in taxa without a reference genome, these methods rely on de novo assembly of fragments in the sequencing library. Many of the software options available for this application were originally developed for other assembly types and we do not know their accuracy for reduced representation libraries. To address this important knowledge gap, we simulated data from the Arabidopsis thaliana and Homo sapiens genomes and compared de novo assemblies by six software programs that are commonly used or promising for this purpose (ABySS, CD‐HIT, Stacks, Stacks2, Velvet and VSEARCH). We simulated different mutation rates and types of mutations, and then applied the six assemblers to the simulated data sets, varying assembly parameters. We found substantial variation in software performance across simulations and parameter settings. ABySS failed to recover any true genome fragments, and Velvet and VSEARCH performed poorly for most simulations. Stacks and Stacks2 produced accurate assemblies of simulations containing SNPs, but the addition of insertion and deletion mutations decreased their performance. CD‐HIT was the only assembler that consistently recovered a high proportion of true genome fragments. Here, we demonstrate the substantial difference in the accuracy of assemblies from different software programs and the importance of comparing assemblies that result from different parameter settings.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Section: Data Availability Statementmentioning

confidence: 99%

See 2 more Smart Citations

Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software

LaCava

Aikens

Megna

et al. 2019

Molecular Ecology Resources

Self Cite

View full text Add to dashboard Cite

show abstract

“…Despite many studies using RRL methods, the field lacks guidelines for best practices, and researchers have experimented with many different de novo assembly pipelines. A survey of 100 studies with de novo assembly of double‐digest RRLs published between February 2015 and October 2017 used 24 unique assembly pipelines (LaCava et al, ). That is approximately one new assembler per four similarly designed studies!…”

mentioning

confidence: 99%

Stacking up RADSeq assembly programs: From complete hit to completely abysmal

Marrano

Palmer

Moyers

2020

Molecular Ecology Resources

View full text Add to dashboard Cite

Decreasing sequencing costs have driven a rapid expansion of novel genotyping methods. One of these methods is the exploitation of restriction enzyme cut sites to generate genome‐wide but reduced representation sequencing libraries (RRLs), alternatively termed genotyping by sequencing or restriction‐site associated DNA sequencing. Without a reference genome, the resulting short sequence reads must be assembled de novo. There are many possible assembly programs, most not explicitly developed for RRL data, and we know little of their effectiveness. In this issue of Molecular Ecology Resources, LaCava et al. (2020) systematically evaluate six commonly used programs and two commonly varied parameters for complete and accurate assembly of RRLs, using simulated double digests of Homo sapiens and Arabidopsis thaliana genomes with varied mutation rates and types. The authors find substantial variation in performance across assembly programs. The most consistently high‐performing assembler is infrequently used in their literature survey (CD‐HIT; Li and Godzik, 2006), while several others fail to produce complete, accurate assemblies under many conditions. LaCava et al. additionally recommend best practices in parameter choice and evaluation of future assembly programs—advice that molecular ecologists working to assemble sequences of all kinds should take to heart.

show abstract

Population genomic diversity and structure at the discontinuous southern range of the Great Gray Owl in North America

Mendelsohn

Bedrosian²,

Stowell

et al. 2020

Conserv Genet

View full text Add to dashboard Cite

Accuracy of de novo assembly of DNA sequences from double-digest libraries varies substantially among software

Cited by 10 publications

References 41 publications

Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software

Accuracy of de novo assembly of DNA sequences from double‐digest libraries varies substantially among software

Stacking up RADSeq assembly programs: From complete hit to completely abysmal

Population genomic diversity and structure at the discontinuous southern range of the Great Gray Owl in North America

Contact Info

Product

Resources

About