The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/
In a randomized study of acute myelocytic leukemia (AMI). 352 patients of all ages were treated for remission induction by one of the four regimens: 7 days of cytosine arabinoside (ara-C) by continuous intravenous (i.v.) infusion or bolus injection every 1 2 hr. together with daunorubicin (DNR) by rapid i.v. injection on days 1. 2. 3; or 5 days of ara-C by infusion or bolus injection and DNR for 2 days only. The regimen of 7 and 3 infusion was significantly superior to the other 3 regimens. resulting in 56% complete remission (CR). For remission maintenance. ara-C was given for 5 days every month and each month one of the following four drugs added on a cyclic rotational basis: thioguanine. cyclophosphamide. CCNU. or DNR. Although ara-C dosage each month was the same. the route of ara-C
We refined the performance of Cocoa/Peaberry, a linguistically motivated system, on extracting disease entities from clinical notes in the training and development sets for Task 7. Entities were identified in noun chunks by use of dictionaries, and events ('The left atrium is dilated') through our own parser and predicate-argument structures.We also developed a module to map the extracted entities to the SNOMED subset of UMLS. The module is based on direct matching against UMLS entries through regular expressions derived from a small set of morphological transformations, along with priority rules when multiple UMLS entries were matched. The performance on training and development sets was 81.0% and 83.3% respectively (Task A), and the UMLS matching scores were respectively 75.3% and 78.2% (Task B). However, the performance against the test set was low by comparison, 72.0% for Task A and 63.9% for Task B, even while the pure UMLS mapping score was reasonably high (relaxed score in Task B = 91.2%). We speculate that our moderate performance on the test set derives primarily from chunking/parsing errors.
Electronic Health Record (EHR) use in India is generally poor, and structured clinical information is mostly lacking. This work is the first attempt aimed at evaluating unstructured text mining for extracting relevant clinical information from Indian clinical records. We annotated a corpus of 250 discharge summaries from an Intensive Care Unit (ICU) in India, with markups for diseases, procedures, and lab parameters, their attributes, as well as key demographic information and administrative variables such as patient outcomes. In this process, we have constructed guidelines for an annotation scheme useful to clinicians in the Indian context. We evaluated the performance of an NLP engine, Cocoa, on a cohort of these Indian clinical records. We have produced an annotated corpus of roughly 90 thousand words, which to our knowledge is the first tagged clinical corpus from India. Cocoa was evaluated on a test corpus of 50 documents. The overlap F-scores across the major categories, namely disease/symptoms, procedures, laboratory parameters and outcomes, are 0.856, 0.834, 0.961 and 0.872 respectively. These results are competitive with results from recent shared tasks based on US records. The annotated corpus and associated results from the Cocoa engine indicate that unstructured text mining is a viable method for cohort analysis in the Indian clinical context, where structured EHR records are largely absent.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.