The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
In this paper we describe a new named entity extraction system. Our system is based on a manually developed set of rules that rely heavily upon some crucial lexical information, linguistic constraints of English, and contextual information. This system achieves state of art results in the protein name detection task, which is what many of the current name extraction systems do. We discuss the need for detection of chemical names and show that we not only obtain a high degree of success in recognizing chemicals but that this task can help improve the precision of protein name detection as well. We use context and surrounding words for categorization of named entities and find the results obtained are encouraging.
A rule-based system, RLIMS-P (Rule-based LIterature Mining System for Protein Phosphorylation), was used to extract protein phosphorylation information from MEDLINE abstracts. An annotation-tagged literature corpus developed at PIR was used to evaluate the system for finding phosphorylation papers and extracting phosphorylation objects (kinases, substrates and sites) from abstracts. RLIMS-P achieved a precision and recall of 91.4 and 96.4% for paper retrieval, and of 97.9 and 88.0% for extraction of substrates and sites. Coupling the high recall for paper retrieval and high precision for information extraction, RLIMS-P facilitates literature mining and database annotation of protein phosphorylation.
Background A significant proportion of children with asthma have delayed diagnosis of asthma by health care providers. Manual chart review according to established criteria is more accurate than directly using diagnosis codes, which tend to under-identify asthmatics, but chart reviews are more costly and less timely. Objective To evaluate the accuracy of a computational approach to asthma ascertainment, characterizing its utility and feasibility toward large-scale deployment in electronic medical records. Methods A natural language processing (NLP) system was developed for extracting predetermined criteria for asthma from unstructured text in electronic medical records and then inferring asthma status based on these criteria. Using manual chart reviews as a gold standard, asthma status (yes vs no) and identification date (first date of a “yes” asthma status) were determined by the NLP system. Results Patients were a group of children (n =112, 84% Caucasian, 49% girls) younger than 4 years (mean 2.0 years, standard deviation 1.03 years) who participated in previous studies. The NLP approach to asthma ascertainment showed sensitivity, specificity, positive predictive value, negative predictive value, and median delay in diagnosis of 84.6%, 96.5%, 88.0%, 95.4%, and 0 months, respectively; this compared favorably with diagnosis codes, at 30.8%, 93.2%, 57.1%, 82.2%, and 2.3 months, respectively. Conclusions Automated asthma ascertainment from electronic medical records using NLP is feasible and more accurate than traditional approaches such as diagnosis codes. Considering the difficulty of labor-intensive manual record review, NLP approaches for asthma ascertainment should be considered for improving clinical care and research, especially in large-scale efforts.
We present an approach that integrates protein structure analysis and text mining for protein functional site prediction, called LEAP-FS (Literature Enhanced Automated Prediction of Functional Sites). The structure analysis was carried out using Dynamics Perturbation Analysis (DPA), which predicts functional sites at control points where interactions greatly perturb protein vibrations. The text mining extracts mentions of residues in the literature, and predicts that residues mentioned are functionally important. We assessed the significance of each of these methods by analyzing their performance in finding known functional sites (specifically, small-molecule binding sites and catalytic sites) in about 100,000 publicly available protein structures. The DPA predictions recapitulated many of the functional site annotations and preferentially recovered binding sites annotated as biologically relevant vs. those annotated as potentially spurious. The text-based predictions were also substantially supported by the functional site annotations: compared to other residues, residues mentioned in text were roughly six times more likely to be found in a functional site. The overlap of predictions with annotations improved when the text-based and structure-based methods agreed. Our analysis also yielded new high-quality predictions of many functional site residues that were not catalogued in the curated data sources we inspected. We conclude that both DPA and text mining independently provide valuable high-throughput protein functional site predictions, and that integrating the two methods using LEAP-FS further improves the quality of these predictions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.