We compared the classification accuracy of two sections of the fungal internal transcribed spacer (ITS) region, individually and combined, and the 5= section (about 600 bp) of the large-subunit rRNA (LSU), using a naive Bayesian classifier and BLASTN. A hand-curated ITS-LSU training set of 1,091 sequences and a larger training set of 8,967 ITS region sequences were used. Of the factors evaluated, database composition and quality had the largest effect on classification accuracy, followed by fragment size and use of a bootstrap cutoff to improve classification confidence. The naive Bayesian classifier and BLASTN gave similar results at higher taxonomic levels, but the classifier was faster and more accurate at the genus level when a bootstrap cutoff was used. All of the ITS and LSU sections performed well (>97.7% accuracy) at higher taxonomic ranks from kingdom to family, and differences between them were small at the genus level (within 0.66 to 1.23%). When full-length sequence sections were used, the LSU outperformed the ITS1 and ITS2 fragments at the genus level, but the ITS1 and ITS2 showed higher accuracy when smaller fragment sizes of the same length and a 50% bootstrap cutoff were used. In a comparison using the larger ITS training set, ITS1 and ITS2 had very similar accuracy classification for fragments between 100 and 200 bp. Collectively, the results show that any of the ITS or LSU sections we tested provided comparable classification accuracy to the genus level and underscore the need for larger and more diverse classification training sets.
Fungi are one of the most diverse groups of eukaryotic organisms on Earth, with estimates that range from 1.5 to 5.1 million species (1, 2). The use of next-generation sequencing (NGS) is playing a major role in the discovery of new species and ecological studies of fungi. Large molecular data sets are being generated at an extraordinary rate (3-6), but diversity estimations and taxonomic identification at all taxonomic levels are constrained by the lack of accurate, comprehensive taxonomic databases and information on the accuracy of classification tools for comparison of environmental survey data. The detection of emergent fungal diseases, the determination of biogeographical patterns, and definition of strategies for conservation of fungi are just a few examples of research areas that are challenged by the lack of reliable databases and tools (7,8). The large number of sequences generated from platforms of high-throughput sequencing also demand fast and accurate algorithms for sequence analysis and taxonomic classification of fungi.The entire internal transcribed spacer (ITS) rRNA region (approximately 600 bp in length) is composed of two hypervariable regions (ITS1 and ITS2) with the highly conserved 5.8S rRNA gene between them (Fig. 1). The ITS region has been used for many years for diversity estimations and taxonomic identification of fungal isolates and uncultured taxa (9-11) and was adapted as the barcode region for Fungi by the Consortium for the Barcode ...