To successfully implement environmental DNA-based (eDNA) diversity monitoring, the completeness and accuracy of reference databases used for taxonomic assignment of eDNA sequences are among the challenges to be tackled. Here, we have developed a workflow that evaluates the current status of GenBank for marine fishes.
Expectations are high regarding the potential of eDNA metabarcoding for diversity monitoring. To make this approach suitable for this purpose, the completeness and accuracy of reference databases used for taxonomic assignment of eDNA sequences are among the challenges to be tackled. Yet, despite ongoing efforts to increase coverage of reference databases, sequences for key species are lacking, and incorrect records in widely used repositories such as GenBank have been reported. This compromises eDNA metabarcoding studies, especially for high diverse groups such as marine fishes. Here, we have developed a workflow that evaluates the completeness and accuracy of GenBank. For a given combination of species and barcodes a gap analysis is performed, and potentially erroneous sequences are identified. Our gap analysis based on the four most used genes (cytochrome c oxidase subunit 1, 12S rRNA, 16S rRNA and cytochrome b) for fish eDNA metabarcoding found that COI, the universal choice for metazoans, is the gene covering the highest number of Northeast Atlantic marine fishes (70%), while 12S rRNA, the preferred region for fish-targeting studies, only covered about 50% of the species. The presence of too close and too distant barcode sequences as expected by their taxonomic classification confirms presence of erroneous sequences in GenBank that our workflow can detect and eliminate. Comparing taxonomic assignments of real marine eDNA samples with raw and clean reference databases for the most used 12S rRNA barcodes (teleo and MiFish), we found that both barcodes perform differently, and demonstrated that the application of the database cleaning workflow can result in drastic changes in community composition. Besides providing an automated tool for reference database curation, this study confirms the need to increase 12S rRNA reference sequences for European marine fishes, encourages the use of a multi-marker approach for better community composition assessment, and evidences the dangers of taxonomic assignments by directly querying GenBank.
Environmental DNA (eDNA) metabarcoding, the process of sequencing DNA collected from the environment for producing biodiversity inventories, is increasingly being applied to assess fish diversity and distribution in marine environments. Yet, the successful application of this technique deeply relies on accurate and complete reference databases used for taxonomic assignment. The most used markers for fish eDNA metabarcoding studies are the cytochrome C oxidase subunit 1 (COI), 16S ribosomal RNA (16S), the 12S ribosomal RNA (12S) and cytochrome b (cyt b) genes, whose sequences are usually retrieved from GenBank, the largest DNA sequence database that represents a worldwide public resource for genetic studies. Thus, the completeness and accuracy of GenBank is critical to derive reliable estimations from fish eDNA metabarcoding data. Here, we have i) compiled the checklist of European marine fishes, ii) performed a gap analysis of the four genes and, within COI and 12S, also of the most used barcodes for fish, and iii) developed a workflow to detect potentially incorrect records in GenBank. We found that from the 1965 species in the checklist (1761 Actinopterygii, 189 Elasmobranchii, 9 Holocephali, 4 Petromyzonti and 2 Myxini), about 70% have sequences for COI, whereas less have sequences for 12S, 16S and cyt b (45-55%). Among the species for which COI ad 12S sequences are available, about 60% and 40% have sequences covering the most used barcodes respectively. The analysis of pairwise distances between sequences revealed pairs belonging to the same species with significantly low similarity and pairs belonging to different high level taxonomic groups (class, order) with significantly large similarity. In light of this further confirmation of presence of a substantial number of incorrect records in GenBank, we propose a method for identifying and removing spurious sequences to create reliable and accurate reference databases for eDNA metabarcoding.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.