BackgroundManually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text.ResultsThis paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement.ConclusionsAs the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
BackgroundOntological concepts are useful for many different biomedical tasks. Concepts are difficult to recognize in text due to a disconnect between what is captured in an ontology and how the concepts are expressed in text. There are many recognizers for specific ontologies, but a general approach for concept recognition is an open problem.ResultsThree dictionary-based systems (MetaMap, NCBO Annotator, and ConceptMapper) are evaluated on eight biomedical ontologies in the Colorado Richly Annotated Full-Text (CRAFT) Corpus. Over 1,000 parameter combinations are examined, and best-performing parameters for each system-ontology pair are presented.ConclusionsBaselines for concept recognition by three systems on eight biomedical ontologies are established (F-measures range from 0.14–0.83). Out of the three systems we tested, ConceptMapper is generally the best-performing system; it produces the highest F-measure of seven out of eight ontologies. Default parameters are not ideal for most systems on most ontologies; by changing parameters F-measure can be increased by up to 0.4. Not only are best performing parameters presented, but suggestions for choosing the best parameters based on ontology characteristics are presented.
BackgroundWe introduce the linguistic annotation of a corpus of 97 full-text biomedical publications, known as the Colorado Richly Annotated Full Text (CRAFT) corpus. We further assess the performance of existing tools for performing sentence splitting, tokenization, syntactic parsing, and named entity recognition on this corpus.ResultsMany biomedical natural language processing systems demonstrated large differences between their previously published results and their performance on the CRAFT corpus when tested with the publicly available models or rule sets. Trainable systems differed widely with respect to their ability to build high-performing models based on this data.ConclusionsThe finding that some systems were able to train high-performing models based on this corpus is additional evidence, beyond high inter-annotator agreement, that the quality of the CRAFT corpus is high. The overall poor performance of various systems indicates that considerable work needs to be done to enable natural language processing systems to work well when the input is full-text journal articles. The CRAFT corpus provides a valuable resource to the biomedical natural language processing community for evaluation and training of new models for biomedical full text publications.
The Gene Ontology (GO) consists of nearly 30,000 classes for describing the activities and locations of gene products. Manual maintenance of an ontology of this size is a considerable effort, and errors and inconsistencies inevitably arise. Reasoners can be used to assist with ontology development, automatically placing classes in a subsumption hierarchy based on their properties. However, the historic lack of computable definitions within the GO has prevented the user of these tools. In this paper we present preliminary results of an ongoing effort to normalize the GO by explicitly stating the definitions of compositional classes in a form that can be used by reasoners. These definitions are partitioned into mutually exclusive cross-product sets, many of which reference other OBO Foundry candidate ontologies for chemical entities, proteins, biological qualities and anatomical entities. Using these logical definitions we are gradually beginning to automate many aspects of ontology development, detecting errors and filling in missing relationships. These definitions also enhance the GO by weaving it into the fabric of a wider collection of interoperating ontologies, increasing opportunities for data integration and enhancing genomic analyses.
BackgroundThe ability to query many independent biological databases using a common ontology-based semantic model would facilitate deeper integration and more effective utilization of these diverse and rapidly growing resources. Despite ongoing work moving toward shared data formats and linked identifiers, significant problems persist in semantic data integration in order to establish shared identity and shared meaning across heterogeneous biomedical data sources.ResultsWe present five processes for semantic data integration that, when applied collectively, solve seven key problems. These processes include making explicit the differences between biomedical concepts and database records, aggregating sets of identifiers denoting the same biomedical concepts across data sources, and using declaratively represented forward-chaining rules to take information that is variably represented in source databases and integrating it into a consistent biomedical representation. We demonstrate these processes and solutions by presenting KaBOB (the Knowledge Base Of Biomedicine), a knowledge base of semantically integrated data from 18 prominent biomedical databases using common representations grounded in Open Biomedical Ontologies. An instance of KaBOB with data about humans and seven major model organisms can be built using on the order of 500 million RDF triples. All source code for building KaBOB is available under an open-source license.ConclusionsKaBOB is an integrated knowledge base of biomedical data representationally based in prominent, actively maintained Open Biomedical Ontologies, thus enabling queries of the underlying data in terms of biomedical concepts (e.g., genes and gene products, interactions and processes) rather than features of source-specific data schemas or file formats. KaBOB resolves many of the issues that routinely plague biomedical researchers intending to work with data from multiple data sources and provides a platform for ongoing data integration and development and for formal reasoning over a wealth of integrated biomedical data.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0559-3) contains supplementary material, which is available to authorized users.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.