Recently diverged species are challenging for identification, yet they are frequently of special interest scientifically as well as from a regulatory perspective. DNA barcoding has proven instrumental in species identification, especially in insects and vertebrates, but for the identification of recently diverged species it has been reported to be problematic in some cases. Problems are mostly due to incomplete lineage sorting or simply lack of a ‘barcode gap’ and probably related to large effective population size and/or low mutation rate. Our objective was to compare six methods in their ability to correctly identify recently diverged species with DNA barcodes: neighbor joining and parsimony (both tree-based), nearest neighbor and BLAST (similarity-based), and the diagnostic methods DNA-BAR, and BLOG. We analyzed simulated data assuming three different effective population sizes as well as three selected empirical data sets from published studies. Results show, as expected, that success rates are significantly lower for recently diverged species (∼75%) than for older species (∼97%) (P<0.00001). Similarity-based and diagnostic methods significantly outperform tree-based methods, when applied to simulated DNA barcode data (P<0.00001). The diagnostic method BLOG had highest correct query identification rate based on simulated (86.2%) as well as empirical data (93.1%), indicating that it is a consistently better method overall. Another advantage of BLOG is that it offers species-level information that can be used outside the realm of DNA barcoding, for instance in species description or molecular detection assays. Even though we can confirm that identification success based on DNA barcoding is generally high in our data, recently diverged species remain difficult to identify. Nevertheless, our results contribute to improved solutions for their accurate identification.
BackgroundSpecific fragments, coming from short portions of DNA (e.g., mitochondrial, nuclear, and plastid sequences), have been defined as DNA Barcode and can be used as markers for organisms of the main life kingdoms. Species classification with DNA Barcode sequences has been proven effective on different organisms. Indeed, specific gene regions have been identified as Barcode: COI in animals, rbcL and matK in plants, and ITS in fungi. The classification problem assigns an unknown specimen to a known species by analyzing its Barcode. This task has to be supported with reliable methods and algorithms.MethodsIn this work the efficacy of supervised machine learning methods to classify species with DNA Barcode sequences is shown. The Weka software suite, which includes a collection of supervised classification methods, is adopted to address the task of DNA Barcode analysis. Classifier families are tested on synthetic and empirical datasets belonging to the animal, fungus, and plant kingdoms. In particular, the function-based method Support Vector Machines (SVM), the rule-based RIPPER, the decision tree C4.5, and the Naïve Bayes method are considered. Additionally, the classification results are compared with respect to ad-hoc and well-established DNA Barcode classification methods.ResultsA software that converts the DNA Barcode FASTA sequences to the Weka format is released, to adapt different input formats and to allow the execution of the classification procedure. The analysis of results on synthetic and real datasets shows that SVM and Naïve Bayes outperform on average the other considered classifiers, although they do not provide a human interpretable classification model. Rule-based methods have slightly inferior classification performances, but deliver the species specific positions and nucleotide assignments. On synthetic data the supervised machine learning methods obtain superior classification performances with respect to the traditional DNA Barcode classification methods. On empirical data their classification performances are at a comparable level to the other methods.ConclusionsThe classification analysis shows that supervised machine learning methods are promising candidates for handling with success the DNA Barcoding species classification problem, obtaining excellent performances. To conclude, a powerful tool to perform species identification is now available to the DNA Barcoding community.
The identification of early and stage-specific biomarkers for Alzheimer's disease (AD) is critical, as the development of disease-modification therapies may depend on the discovery and validation of such markers. The identification of early reliable biomarkers depends on the development of new diagnostic algorithms to computationally exploit the information in large biological datasets. To identify potential biomarkers from mRNA expression profile data, we used the Logic Mining method for the unbiased analysis of a large microarray expression dataset from the anti-NGF AD11 transgenic mouse model. The gene expression profile of AD11 brain regions was investigated at different neurodegeneration stages by whole genome microarrays. A new implementation of the Logic Mining method was applied both to early (1-3 months) and late stage (6-15 months) expression data, coupled to standard statistical methods. A small number of "fingerprinting" formulas was isolated, encompassing mRNAs whose expression levels were able to discriminate between diseased and control mice. We selected three differential "signature" genes specific for the early stage (Nudt19, Arl16, Aph1b), five common to both groups (Slc15a2, Agpat5, Sox2ot, 2210015, D19Rik, Wdfy1), and seven specific for late stage (D14Ertd449, Tia1, Txnl4, 1810014B01Rik, Snhg3, Actl6a, Rnf25). We suggest these genes as potential biomarkers for the early and late stage of AD-like neurodegeneration in this model and conclude that Logic Mining is a powerful and reliable approach for large scale expression data analysis. Its application to large expression datasets from brain or peripheral human samples may facilitate the discovery of early and stage-specific AD biomarkers.
BLOG (Barcoding with LOGic) is a diagnostic and character-based DNA Barcode analysis method. Its aim is to classify specimens to species based on DNA Barcode sequences and on a supervised machine learning approach, using classification rules that compactly characterize species in terms of DNA Barcode locations of key diagnostic nucleotides. The BLOG 2.0 software, its fundamental modules, online/offline user interfaces and recent improvements are described. These improvements affect both methodology and software design, and lead to the availability of different releases on the website http://dmb.iasi.cnr.it/blog-downloads.php. Previous and new experimental tests show that BLOG 2.0 outperforms previous versions as well as other DNA Barcode analysis methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.