We introduce an extensible and modifiable knowledge representation model to represent cancer disease characteristics in a comparable and consistent fashion. We describe a system, MedTAS/P which automatically instantiates the knowledge representation model from free-text pathology reports. MedTAS/P is based on an open-source framework and its components use natural language processing principles, machine learning and rules to discover and populate elements of the model. To validate the model and measure the accuracy of MedTAS/P, we developed a gold-standard corpus of manually annotated colon cancer pathology reports. MedTAS/P achieves F1-scores of 0.97-1.0 for instantiating classes in the knowledge representation model such as histologies or anatomical sites, and F1-scores of 0.82-0.93 for primary tumors or lymph nodes, which require the extractions of relations. An F1-score of 0.65 is reported for metastatic tumors, a lower score predominantly due to a very small number of instances in the training and test sets.
The aim of this study is to explore the word sense disambiguation (WSD) problem across two biomedical domains-biomedical literature and clinical notes. A supervised machine learning technique was used for the WSD task. One of the challenges addressed is the creation of a suitable clinical corpus with manual sense annotations. This corpus in conjunction with the WSD set from the National Library of Medicine provided the basis for the evaluation of our method across multiple domains and for the comparison of our results to published ones. Noteworthy is that only 20% of the most relevant ambiguous terms within a domain overlap between the two domains, having more senses associated with them in the clinical space than in the biomedical literature space. Experimentation with 28 different feature sets rendered a system achieving an average F-score of 0.82 on the clinical data and 0.86 on the biomedical literature.
Accurate and reliable part-of-speech tagging is useful for many Natural Language Processing (NLP) tasks that form the foundation of NLP-based approaches to information retrieval and data mining. In general, large annotated corpora are necessary to achieve desired part-of-speech tagger accuracy. We show that a large annotated general-English corpus is not sufficient for building a part-of-speech tagger model adequate for tagging documents from the medical domain. However, adding a quite small domain-specific corpus to a large general-English one boosts performance to over 92% accuracy from 87% in our studies. We also suggest a number of characteristics to quantify the similarities between a training corpus and the test data. These results give guidance for creating an appropriate corpus for building a part-of-speech tagger model that gives satisfactory accuracy results on a new domain at a relatively small cost.
Although structured electronic health records are becoming more prevalent, much information about patient health is still recorded only in unstructured text. "Understanding" these texts has been a focus of natural language processing (NLP) research for many years, with some remarkable successes, yet there is more work to be done. Knowing the drugs patients take is not only critical for understanding patient health (e.g., for drug-drug interactions or drug-enzyme interaction), but also for secondary uses, such as research on treatment effectiveness. Several drug dictionaries have been curated, such as RxNorm, FDA's Orange Book, or NCI, with a focus on prescription drugs. Developing these dictionaries is a challenge, but even more challenging is keeping these dictionaries up-to-date in the face of a rapidly advancing field-it is critical to identify grapefruit as a "drug" for a patient who takes the prescription medicine Lipitor, due to their known adverse interaction. To discover other, new adverse drug interactions, a large number of patient histories often need to be examined, necessitating not only accurate but also fast algorithms to identify pharmacological substances. In this paper we propose a new algorithm, SPOT, which identifies drug names that can be used as new dictionary entries from a large corpus, where a "drug" is defined as a substance intended for use in the diagnosis, cure, mitigation, treatment, or prevention of disease. Measured against a manually annotated reference corpus, we present precision and recall values for SPOT. SPOT is language and syntax independent, can be run efficiently to keep dictionaries up-to-date and to also suggest words and phrases which may be misspellings or uncatalogued synonyms of a known drug. We show how SPOT's lack of reliance on NLP tools makes it robust in analyzing clinical medical text. SPOT is a generalized bootstrapping algorithm, seeded with a known dictionary and automatically extracting the context within which each drug is mentioned. We define three features of such context: support, confidence and prevalence. Finally, we present the performance tradeoffs depending on the thresholds chosen for these features.
Biomedical text plays a fundamental role in knowledge discovery in life science, in both basic research (in the field of bioinformatics) and in industry sectors devoted to improving medical practice, drug development, and health care (such as medical informatics, clinical genomics, and other sectors). Several groups in the IBM Research Division are collaborating on the development of a prototype system for text analysis, search, and text-mining methods to support problem solving in life science. The system is called "BioTeKS" ("Biological Text Knowledge Services"), and it integrates research technologies from multiple IBM Research labs. BioTeKS is also the first major application of the UIMA (Unstructured Information Management Architecture) initiative also emerging from IBM Research. BioTeKS is intended to analyze biomedical text such as MEDLINE™ abstracts, medical records, and patents; text is analyzed by automatically identifying terms or names corresponding to key biomedical entities (e.g., "genes," "proteins," "compounds," or "drugs") and concepts or facts related to them. In this paper, we describe the value of text analysis in biomedical research, the development of the BioTeKS system, and applications which demonstrate its functions. The large scale sequencing of the human genome has greatly increased our knowledge of the genetic basis of biological processes and accelerated the pace of research and development aimed at treating disease and enhancing the health and well-being of humans. However, these advances also result in increased complexity in understanding and applying biomedical research and data. There is consensus in the life-science (LS) industry and academic laboratories that managing the complexity of biological data and knowledge requires an integrative, informationbased systems approach, in which computer technology must play an essential role. For a cogent analysis of this situation and the role of computational methods in life science, see References 1-3. Key components of computational technology that are relevant to this effort include analyzing, searching, and mining biomedical text, and correlating the structured data derived from texts with data derived from biomedical experiments, transcribed medical records, and so on. This paper describes an IBM Research project to exploit and develop the textanalytical technology needed for managing, analyzing, and using biomedical text to solve problems in life science. We call the system BioTeKS for "Biological Text Knowledge Services." BioTeKS is also one of the first major systems implemented with the IBM Unstructured Information Management Architecture (UIMA), which is described later in this paper, in other papers in this issue, 4 and elsewhere. 5
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.