Towards a semantic lexicon for biological language processing

Verspoor, Karin

doi:10.1002/cfg.451

Cited by 8 publications

(5 citation statements)

References 4 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The low overlap between UMLS and PubMed text has led to a few efforts for enriching controlled vocabularies. Mostly, it has been done by either filtering UMLS terms [ 21 , 27 , 29 , 34 ] or reclassifying UMLS concepts [ 35 , 36 ] for NLP problems. Bodenreider et al [ 37 ], however, suggested an idea of using adjectival modifiers and demodified terms to extend the UMLS Metathesaurus.…”

Section: Introductionmentioning

confidence: 99%

Identifying named entities from PubMed® for enriching semantic categories

Kim

Wilbur

2015

BMC Bioinformatics

View full text Add to dashboard Cite

BackgroundControlled vocabularies such as the Unified Medical Language System (UMLS®) and Medical Subject Headings (MeSH®) are widely used for biomedical natural language processing (NLP) tasks. However, the standard terminology in such collections suffers from low usage in biomedical literature, e.g. only 13% of UMLS terms appear in MEDLINE®.ResultsWe here propose an efficient and effective method for extracting noun phrases for biomedical semantic categories. The proposed approach utilizes simple linguistic patterns to select candidate noun phrases based on headwords, and a machine learning classifier is used to filter out noisy phrases. For experiments, three NLP rules were tested and manually evaluated by three annotators. Our approaches showed over 93% precision on average for the headwords, “gene”, “protein”, “disease”, “cell” and “cells”.ConclusionsAlthough biomedical terms in knowledge-rich resources may define semantic categories, variations of the controlled terms in literature are still difficult to identify. The method proposed here is an effort to narrow the gap between controlled vocabularies and the entities used in text. Our extraction method cannot completely eliminate manual evaluation, however a simple and automated solution with high precision performance provides a convenient way for enriching semantic categories by incorporating terms obtained from the literature.Electronic supplementary materialThe online version of this article (doi:10.1186/s12859-015-0487-2) contains supplementary material, which is available to authorized users.

show abstract

Section: Introductionmentioning

confidence: 99%

Identifying named entities from PubMed® for enriching semantic categories

Kim

Wilbur

2015

BMC Bioinformatics

View full text Add to dashboard Cite

show abstract

“…It has been previously noted that terms in structured controlled vocabularies, such as the Gene Ontology (GO) (The Gene Ontology Consortium, 2000 ), often have a highly regular, even compositional, linguistic structure and that this structure can be exploited for the purposes of accessing those terms computationally and reasoning over them (Mungall, 2004 ; Ogren et al , 2004 ; Verspoor, 2005 ). This regularity is particularly important now that there are efforts to perform intra- or inter-ontology enrichment by linking terms (Bada and Hunter, 2008 ), because the tools that are used to support these efforts analyze the formal structure of the terms and take advantage of patterns of expression.…”

Section: Introductionmentioning

confidence: 99%

Ontology quality assurance through analysis of term transformations

et al. 2009

View full text Add to dashboard Cite

Motivation: It is important for the quality of biological ontologies that similar concepts be expressed consistently, or univocally. Univocality is relevant for the usability of the ontology for humans, as well as for computational tools that rely on regularity in the structure of terms. However, in practice terms are not always expressed consistently, and we must develop methods for identifying terms that are not univocal so that they can be corrected.Results: We developed an automated transformation-based clustering methodology for detecting terms that use different linguistic conventions for expressing similar semantics. These term sets represent occurrences of univocality violations. Our method was able to identify 67 examples of univocality violations in the Gene Ontology.Availability: The identified univocality violations are available upon request. We are preparing a release of an open source version of the software to be available at http://bionlp.sourceforge.net.Contact: karin.verspoor@ucdenver.edu

show abstract

“…Two of the major approaches are preparing sharable semantic lexica and preparing sharable semantic grammar rules:

To create a semantic lexicon especially for processing discharge summaries, Johnson [28] proposed associating the Specialist lexemes via Metathesaurus concepts to appropriate UMLS semantic types. Similarly, Verspoor [29] created a semantic lexicon for processing biological literature.

As early as late 60s, Pratt & Pacak [30] already proposed intricate syntacto-semantic grammars that incorporated the semantic classes ( Etiology , Function , General , Morphology , and Topography ) of the Systematized Nomenclature of Pathology (SNOP) [31], the precursor of the SNOMED [32]. Decades later, Do Amaral Marcio & Satomura [33] picked up the idea again and incorporated SNOMED semantic classes into a syntacto-semantic grammar.…”

Section: Related Workmentioning

confidence: 99%

Deriving a probabilistic syntacto-semantic grammar for biomedicine based on domain-specific terminologies

Fan

Friedman

2011

Journal of Biomedical Informatics

View full text Add to dashboard Cite

Biomedical natural language processing (BioNLP) is a useful technique that unlocks valuable information stored in textual data for practice and/or research. Syntactic parsing is a critical component of BioNLP applications that rely on correctly determining the sentence and phrase structure of free text. In addition to dealing with the vast amount of domain-specific terms, a robust biomedical parser needs to model the semantic grammar to obtain viable syntactic structures. With either a rule-based or corpus-based approach, the grammar engineering process requires substantial time and knowledge from experts, and does not always yield a semantically transferable grammar. To reduce the human effort and to promote semantic transferability, we propose an automated method for deriving a probabilistic grammar based on a training corpus consisting of concept strings and semantic classes from the Unified Medical Language System (UMLS), a comprehensive terminology resource widely used by the community. The grammar is designed to specify noun phrases only due to the nominal nature of the majority of biomedical terminological concepts. Evaluated on manually parsed clinical notes, the derived grammar achieved a recall of 0.644, precision of 0.737, and average cross-bracketing of 0.61, which demonstrated better performance than a control grammar with the semantic information removed. Error analysis revealed shortcomings that could be addressed to improve performance. The results indicated the feasibility of an approach which automatically incorporates terminology semantics in the building of an operational grammar. Although the current performance of the unsupervised solution does not adequately replace manual engineering, we believe once the performance issues are addressed, it could serve as an aide in a semi-supervised solution.

show abstract

Towards a semantic lexicon for biological language processing

Cited by 8 publications

References 4 publications

Identifying named entities from PubMed® for enriching semantic categories

Identifying named entities from PubMed® for enriching semantic categories

Ontology quality assurance through analysis of term transformations

Deriving a probabilistic syntacto-semantic grammar for biomedicine based on domain-specific terminologies

Contact Info

Product

Resources

About