Abstract16 Metabarcoding is a popular application which warrants continued methods optimization. To 17 maximize barcoding inferences, hierarchy-based sequence classification methods are 18 increasingly common. We present methods for the construction and curation of a database 19 designed for hierarchical classification of a 157 bp barcoding region of the arthropod cytochrome 20 c oxidase subunit I (COI) locus. We produced a comprehensive arthropod COI amplicon dataset 21 including annotated arthropod COI sequences and COI sequences extracted from arthropod 22 whole mitochondrion genomes, which provided the only source of representation for Zoraptera, 23 Callipodida and Holothyrida. The database contains extracted sequences of the target amplicon 24 from all major arthropod clades, including all insect orders, all arthropod classes and 25 Onychophora, Tardigrada and Mollusca outgroups. During curation, we extracted the COI region 26 of interest from approximately 81 percent of the input sequences, corresponding to 73 percent of 27 the genus-level diversity found in the input data. Further, our analysis revealed a high degree of 28 sequence redundancy within the NCBI nucleotide database, with a mean of approximately 11 29 sequence entries per species in the input data. The curated, low-redundancy database is included 30 in the Metaxa2 sequence classification software (http://microbiology.se/software/metaxa2/).31 Using this database with the Metaxa2 classifier, we characterized the relationship between the 32 Metaxa2 reliability score, an estimate of classification confidence, and classification error 33 probability. We used this analysis to select a reliability score threshold which minimized error. 34 We then estimated classification sensitivity, false discovery rate and overclassification, the 35 propensity to classify sequences from taxa not represented in the reference database. Our work 36 will help researchers design and evaluate classification databases and conduct metabarcoding on 37 arthropods and alternate taxa.
Introduction
39With the increasing availability of high-throughput DNA sequencing, scientists with a 40 wide diversity of backgrounds and interests are increasingly utilizing this technology to achieve 41 a variety of goals. One growing area of interest involves the use of metabarcoding, or amplicon 42 sequencing, for biomonitoring, biodiversity assessment and community composition inference 43 (Yu et al. 2012; Guardiola et al. 2015; Richardson et al. 2015). Using universal primers designed 44 to amplify conserved genomic regions across a broad diversity of taxonomic groups of interest, 45 researchers are afforded the opportunity to survey biological communities at previously 46 unprecedented scales. While such advancements hold great promise for improving our 47 knowledge of the biological world, they also represent new challenges to the scientific 48 community.
49Given that bioinformatic methods for taxonomic inference of metabarcoding sequence 50 data are relatively new, the development, val...