The accuracy of specimen identification through DNA barcoding and metabarcoding relies on reference libraries containing records with reliable taxonomy and sequence quality. The considerable growth in barcode data requires stringent data curation, especially in taxonomically difficult groups such as marine invertebrates. A major effort in curating marine barcode data in the Barcode of Life Data Systems (BOLD) was undertaken during the 8th International Barcode of Life Conference (Trondheim, Norway, 2019). Major taxonomic groups (crustaceans, echinoderms, molluscs, and polychaetes) were reviewed to identify those which had disagreement between Linnaean names and Barcode Index Numbers (BINs). The records with disagreement were annotated with four tags: a) MIS-ID (misidentified, mislabeled, or contaminated records), b) AMBIG (ambiguous records unresolved with the existing data), c) COMPLEX (species names occurring in multiple BINs), and d) SHARE (barcodes shared between species). A total of 83,712 specimen records corresponding to 7,576 species were reviewed and 39% of the species were tagged (7% MIS-ID, 17% AMBIG, 14% COMPLEX, and 1% SHARE). High percentages (>50%) of AMBIG tags were recorded in gastropods, whereas COMPLEX tags dominated in crustaceans and polychaetes. The high proportion of tagged species reflects either flaws in the barcoding workflow (e.g., misidentification, cross-contamination) or taxonomic difficulties (e.g., synonyms, undescribed species). Although data curation is essential for barcode applications, such manual attempts to examine large datasets are unsustainable and automated solutions are extremely desirable.
Because DNA metabarcoding typically employs sequence diversity among mitochondrial amplicons to estimate species composition, nuclear mitochondrial pseudogenes (NUMTs) can inflate diversity. This study quantifies the incidence and attributes of NUMTs derived from the 658‐bp barcode region of cytochrome c oxidase I (COI) in 156 marine animal genomes. NUMTs were examined to ascertain if they could be recognized by their possession of indels or stop codons. In total, 309 NUMTs ≥150 bp were detected, with an average of 1.98 per species (range = 0–33) and a mean length of 391 ± 200 bp. Among this total, 75 (24.3%) lacked indels or stop codons. NUMTs appear to pose the greatest interpretational risk when short (<313 bp) amplicons are used, such as in environmental DNA studies, dietary analyses or processed fish identification. Employing the standard amplicon length (313 bp) for marine metabarcoding, NUMTs could potentially inflate the operational taxonomic unit (OTU) count by 21% above the true species count while also raising intraspecific variation at COI by 15%. However, when both amplicon length and position are considered, inflation in OTU counts and in barcode variation were just 9% and 10%, respectively, suggesting NUMTs will not seriously distort biodiversity assessments. There was a weak positive correlation between genome size and NUMT count but no variation among phyla or trophic groups. Until bioinformatic advances improve NUMT detection, the best defence involves targeting long amplicons and developing reference databases that include both mitochondrial sequences and their NUMT derivatives.
Because DNA metabarcoding typically employs sequence diversity among mitochondrial amplicons to estimate species composition, nuclear mitochondrial pseudogenes (NUMTs) can inflate diversity. This study quantifies the incidence and attributes of NUMTs derived from the 658 bp barcode region of cytochrome c oxidase I (COI) in 156 marine animal genomes. The number of NUMTs meeting four length criteria (>150 bp, >300 bp, >450 bp, >600 bp) was determined, and they were examined to ascertain if they could be recognized by their possession of indels or stop codons. In total, 389 NUMTs <100 bp were detected, with an average of 2.49 per species (range = 0–50) and a mean length of 336 bp +/- 208 bp. Among NUMTs lacking diagnostic features, 52.5% were ≤300 bp, 63.9% were ≤450 bp, and 76.2% were ≤600 bp. Studies examing 150 bp amplicons inflate the OTU count by 1.57x compared to the true species count and increase perceived intraspecific variation at COI by 1.19x (when sequence variants with >2% sequence divergence are recognized as different OTUs). There was a weak positive correlation between genome size and NUMT count but no variation among phyla, trophic groups or life history traits. While bioinformatic advances will improve NUMT detection, the best defense involves targeting long amplicons and developing reference databases that include both mitochondrial sequences and their NUMT derivatives.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.