Background: Scientific data and research results are being published at an unprecedented rate. Many database curators and researchers utilize data and information from the primary literature to populate databases, form hypotheses, or as the basis for analyses or validation of results. These efforts largely rely on manual literature surveys for collection of these data, and while querying the vast amounts of literature using keywords is enabled by repositories such as PubMed, filtering relevant articles from such query results can be a non-trivial and highly time consuming task. Results: We here present a tool that enables users to perform classification of scientific literature by text mining-based classification of article abstracts. BioReader (Biomedical Research Article Distiller) is trained by uploading article corpora for two training categories-e.g. one positive and one negative for content of interest-as well as one corpus of abstracts to be classified and/or a search string to query PubMed for articles. The corpora are submitted as lists of PubMed IDs and the abstracts are automatically downloaded from PubMed, preprocessed, and the unclassified corpus is classified using the best performing classification algorithm out of ten implemented algorithms. Conclusion: BioReader supports data and information collection by implementing text mining-based classification of primary biomedical literature in a web interface, thus enabling curators and researchers to take advantage of the vast amounts of data and information in the published literature. BioReader outperforms existing tools with similar functionalities and expands the features used for mining literature in database curation efforts.
BackgroundThe Immune Epitope Database (IEDB) project manually curates information from published journal articles that describe immune epitopes derived from a wide variety of organisms and associated with different diseases. In the past, abstracts of scientific articles were retrieved by broad keyword queries of PubMed, and were classified as relevant (curatable) or irrelevant (not curatable) to the scope of the database by a Naïve Bayes classifier. The curatable abstracts were subsequently manually classified into categories corresponding to different disease domains. Over the past four years, we have examined how to further improve this approach in order to enhance classification performance and to reduce the need for manual intervention.ResultsUtilizing 89,884 abstracts classified by a domain expert as curatable or uncuratable, we found that a SVM classifier outperformed the previously used Naïve Bayes classifier for curatability predictions with an AUC of 0.899 and 0.854, respectively. Next, using a non-hierarchical and a hierarchical application of SVM classifiers trained on 22,833 curatable abstracts manually classified into three levels of disease specific categories we demonstrated that a hierarchical application of SVM classifiers outperformed non-hierarchical SVM classifiers for categorization. Finally, to optimize the hierarchical SVM classifiers' error profile for the curation process, cost sensitivity functions were developed to avoid serious misclassifications. We tested our design on a benchmark dataset of 1,388 references and achieved an overall category prediction accuracy of 94.4%, 93.9%, and 82.1% at the three levels of categorization, respectively.ConclusionsA hierarchical application of SVM algorithms with cost sensitive output weighting enabled high quality reference classification with few serious misclassifications. This enabled us to significantly reduce the manual component of abstract categorization. Our findings are relevant to other databases that are developing their own document classifier schema and the datasets we make available provide large scale real-life benchmark sets for method developers.
The recent increase in whooping cough in vaccinated populations has been attributed to waning immunity associated with the acellular vaccine. The Immune Epitope Database (IEDB) is a repository of immune epitope data from the published literature and includes T cell and antibody epitopes for human pathogens. The IEDB conducted a review of the epitope literature, which revealed 300 Bordetella pertussis-related epitopes from 39 references. Epitope data are currently available for six virulence factors of B. pertussis: pertussis toxin, pertactin, fimbrial 2, fimbrial 3, adenylate cyclase and filamentous hemagglutinin. The majority of epitopes were defined for antibody reactivity; fewer T cell determinants were reported. Analysis of available protective correlates data revealed a number of candidate epitopes; however few are defined in humans and few have been shown to be protective. Moreover, there are a limited number of studies defining epitopes from natural infection versus whole cell or acellular/subunit vaccines. The relationship between epitope location and structural features, as well as antigenic drift (SNP analysis) was also investigated. We conclude that the cumulative data is yet insufficient to address many fundamental questions related to vaccine failure and this underscores the need for further investigation of B. pertussis immunity at the molecular level.
BackgroundMHC molecules are a highly diverse family of proteins that play a key role in cellular immune recognition. Over time, different techniques and terminologies have been developed to identify the specific type(s) of MHC molecule involved in a specific immune recognition context. No consistent nomenclature exists across different vertebrate species.PurposeTo correctly represent MHC related data in The Immune Epitope Database (IEDB), we built upon a previously established MHC ontology and created an ontology to represent MHC molecules as they relate to immunological experiments.DescriptionThis ontology models MHC protein chains from 16 species, deals with different approaches used to identify MHC, such as direct sequencing verses serotyping, relates engineered MHC molecules to naturally occurring ones, connects genetic loci, alleles, protein chains and multi-chain proteins, and establishes evidence codes for MHC restriction. Where available, this work is based on existing ontologies from the OBO foundry.ConclusionsOverall, representing MHC molecules provides a challenging and practically important test case for ontology building, and could serve as an example of how to integrate other ontology building efforts into web resources.Electronic supplementary materialThe online version of this article (doi:10.1186/s13326-016-0045-5) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.