Debar: A sequence‐by‐sequence denoiser for COI‐5P DNA barcode data

Nugent, Cameron M.; Elliott, Tyler A.; Ratnasingham, Sujeevan; Hebert, Paul D. N.; Adamowicz, Sarah J.

doi:10.1111/1755-0998.13384

Cited by 2 publications

(3 citation statements)

References 45 publications

(87 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Similar computational and statistical tools, in the form of MATLAB packages, R packages, Python packages and methodological pipelines, used to assess anomalies in DNA (meta)barcodes, have been released. Examples include divisive hierarchical clustering: DADA ( Rosen et al 2012 ) and DADA2 ( Callahan et al 2016 ); artificial neural networks: ( Ma et al 2018 ); Profile Hidden Markov Models: coil ( Nugent et al 2020 ), debar ( Nugent et al 2021 and Porter and Hajibabaei 2021 ); distribution sample quantiles: MACER ( Young et al 2021 ); and Shannon entropy: SequenceBouncer ( Dunn 2021 ), A2G2 ( Hleap et al 2020 ), DnoisE ( Antich et al 2022 and Turon et al 2020 ). These methods and programmes are beginning to see widespread use within the biodiversity and regulatory science communities.…”

Section: Discussionmentioning

confidence: 99%

VLF: An R package for the analysis of very low frequency variants in DNA sequences

Phillips¹,

Athey²,

McNicholas³

et al. 2023

BDJ

View full text Add to dashboard Cite

Here, we introduce VLF, an R package to determine the distribution of very low frequency variants (VLFs) in nucleotide and amino acid sequences for the analysis of errors in DNA sequence records. The package allows users to assess VLFs in aligned and trimmed protein-coding sequences by automatically calculating the frequency of nucleotides or amino acids in each sequence position and outputting those that occur under a user-specified frequency (default of p = 0.001). These results can then be used to explore fundamental population genetic and phylogeographic patterns, mechanisms and processes at the microevolutionary level, such as nucleotide and amino acid sequence conservation. Our package extends earlier work pertaining to an implementation of VLF analysis in Microsoft Excel, which was found to be both computationally slow and error prone. We compare those results to our own herein. Results between the two implementations are found to be highly consistent for a large DNA barcode dataset of bird species. Differences in results are readily explained by both manual human error and inadequate Linnean taxonomy (specifically, species synonymy). Here, VLF is also applied to a subset of avian barcodes to assess the extent of biological artifacts at the species level for Canada goose (Branta canadensis), as well as within a large dataset of DNA barcodes for fishes of forensic and regulatory importance. The novelty of VLF and its benefit over the previous implementation include its high level of automation, speed, scalability and ease-of-use, each desirable characteristics which will be extremely valuable as more sequence data are rapidly accumulated in popular reference databases, such as BOLD and GenBank.

show abstract

Section: Discussionmentioning

confidence: 99%

VLF: An R package for the analysis of very low frequency variants in DNA sequences

Phillips¹,

Athey²,

McNicholas³

et al. 2023

BDJ

View full text Add to dashboard Cite

show abstract

“…Machine learning approaches also allow prediction of patterns of biodiversity at large geographical scales by facilitating the combination of genomic, ecological and geographical data in novel ways (Barrow et al, 2020). Finally, machine learning can improve our estimates of biodiversity by allowing efficient error correction of barcoding data sets (Nugent et al, 2021). Again, the diversity of ap-…”

Section: Biodiversity and Species Limitsmentioning

confidence: 99%

“…However, errors can result in inflated estimates of diversity when not corrected. To address this issue,Nugent et al (2021) introduce debar, an approach for denoising COI-5P DNA barcode data using machine learning. Debar uses a Profile Hidden Markov model (PHMM) to detect indel errors in COI barcoding data.…”

mentioning

confidence: 99%