Evaluating Feature Extraction Methods for Knowledge-based
            Biomedical Word Sense Disambiguation

Henry, S.; Cuffy, Clint; McInnes, Bridget T.

doi:10.18653/v1/w17-2334

Cited by 7 publications

(7 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Unsupervised methods do not require labeled training examples and typically use graph-based clustering techniques [15]. Recently, word embedding models [16] and pre-trained language model BERT (Bidirectional Encoder Representation from Transformers) [17], [18] and its variant BERT models [19], [20], [21] all pre-trained on large corpora were introduced to conduct unsupervised learning for WSD. For example, Mao and Wah [6] generate semantic relatedness measurements between UMLS concepts to achieve disambiguation by applying the word embedding models and various flavors of BERT.…”

Section: Previous Workmentioning

confidence: 99%

Weighted Co-Occurrence Bio-Term Graph for Unsupervised Word Sense Disambiguation in the Biomedical Domain

et al. 2023

View full text Add to dashboard Cite

Word Sense Disambiguation (WSD) is a significant and challenging task for text understanding and processing. This paper presents an unsupervised approach based on Weighted Co-occurrence bio-Term Graph (WCOTG) for performing WSD in the biomedical domain. The graph is automatically created from biomedical terms that are extracted from a corpus of downloaded scientific abstracts. Two kinds of weights are introduced on the links of the built bio-term graph and are taken as important factors in the process of disambiguation. The modified Personalised PageRank (PPR) algorithm is used for performing WSD. When evaluated on the NLM-WSD and MSH-WSD test datasets, and an acronym test set, the method outperforms the widely used unsupervised ones addressing the same problem, and the average result is almost equal to that of the BlueBERT_LE-based method. In contrast, our method has no additional enhancement or training for BERT-based models. Comparative experiments validate the positive effect of links' weight on disambiguation efficiency. Last, the statistical experiments on the relation among system accuracy, the numbers of medical abstracts in the corpus, and the corresponding extracted terms suggest an excellent minimum corpus scale, when resources are limited.INDEX TERMS Biomedical informatics, biomedical natural language processing, word sense disambiguation, unified medical language system, personalised PageRank algorithm. I. INTRODUCTIONWord Sense Disambiguation (WSD) systems attempt to automatically identify the proper sense of ambiguous words in context [1], [2]. For example, WSD would aim to identify the meaning of the word ''cold'' to be ''cold temperature'' or ''common cold'' depending on the context in which it occurs. WSD is often characterized as an intermediary step in the process of understanding natural language texts [2], [3]. It is beneficial for applications in the biomedical domain, such as information extraction, automated knowledge discovery, question-answering [4], etc.The associate editor coordinating the review of this manuscript and approving it for publication was Giovanni Dimauro .The disambiguation task in the biomedical field is aimed at medical texts and terms. Combining domain knowledge can better improve the system's performance. Currently, the Unified Medical Language System (UMLS) Knowledge Sources are widely used in tasks such as disambiguation and automatic question-answering in the biomedical field. However, existing work such as [5] and [6], etc., typically involves extracting concepts (CUIs) from UMLS to build systems. In disambiguation tasks, this can lead to a mapping bottleneck when converting polysemy in biomedical documents into concepts. Therefore, we build term graphs directly, not concept graphs, the latter requires the use of term-toconcept mapping tools. This method can be considered an improvement of the existing work by Duque et al. [5], whose

show abstract

Section: Previous Workmentioning

confidence: 99%

Weighted Co-Occurrence Bio-Term Graph for Unsupervised Word Sense Disambiguation in the Biomedical Domain

et al. 2023

View full text Add to dashboard Cite

show abstract

“…The YTEX suite of algorithms [103,104] extends both MetaMap and cTAKES with a disambiguation module that helps to reduce noise considerably, although [105] found that it often over-filtered correct concepts. There has also been significant research in recent years on developing standalone models for disambiguation, using co-occurrence and feature-based approaches [106][107][108] as well as neural models [37,109]. Medical concept normalization more broadly has also become an increasing research focus [38,15], with significant opportunities for disambiguation research [21].…”

Section: Opportunities For Disambiguation Research Using Semantic Typ...mentioning

confidence: 99%

Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

Vashishth

Newman-Griffis

Joshi

et al. 2021

Journal of Biomedical Informatics

View full text Add to dashboard Cite

“…While some of the deep learning models directly employ word embeddings for disambiguation (Wu et al, 2015;Antunes and Matos, 2017;Charbonnier and Wartena, 2018;Ciosici et al, 2019), some of them employ deep architectures to encode the context of the acronym (Jin et al, 2019;Li et al, 2019). Moreover, acronym disambiguation has been also modeled as the more general tasks Word Sense Disambiguation (WSD) (Henry et al, 2017;Tulkens et al, 2016) or Entity Linking (EL) (Cheng and Roth, 2013;Li et al, 2015). While the majority of the prior work studies AD in medical domain (Okazaki and Ananiadou, 2006;Vo et al, 2016;Wu et al, 2017), recently some work proposes acronym disambiguation in general (Ciosici et al, 2019), enterprise (Li et al, 2018), or scientific domain (Charbonnier and Wartena, 2018).…”

Section: Related Workmentioning

confidence: 99%

What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation

Veyseh

Dernoncourt

Tran

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

View full text Add to dashboard Cite

Acronyms are the short forms of phrases that facilitate conveying lengthy sentences in documents and serve as one of the mainstays of writing. Due to their importance, identifying acronyms and corresponding phrases (i.e., acronym identification (AI)) and finding the correct meaning of each acronym (i.e., acronym disambiguation (AD)) are crucial for text understanding. Despite the recent progress on this task, there are some limitations in the existing datasets which hinder further improvement. More specifically, limited size of manually annotated AI datasets or noises in the automatically created acronym identification datasets obstruct designing advanced highperforming acronym identification models. Moreover, the existing datasets are mostly limited to the medical domain and ignore other domains. In order to address these two limitations, we first create a manually annotated large AI dataset for scientific domain. This dataset contains 17,506 sentences which is substantially larger than previous scientific AI datasets. Next, we prepare an AD dataset for scientific domain with 62,441 samples which is significantly larger than previous scientific AD dataset. Our experiments show that the existing state-of-the-art models fall far behind human-level performance on both datasets proposed by this work. In addition, we propose a new deep learning model which utilizes the syntactical structure of the sentence to expand an ambiguous acronym in a sentence. The proposed model outperforms the state-of-the-art models on the new AD dataset, providing a strong baseline for future research on this dataset 1 . IntroductionAcronyms are shortened forms of a longer phrase. As a running example, in the sentence "The main key performance indicator, herein referred to as KPI, is the E2E throughput" there are two acronyms KPI and E2E. Also, the acronym KPI refers to the phrase key performance indicator (a.k.a. the long form of the acronym KPI). In written language, acronyms are prevalent in technical documents that helps to avoid the repetition of long and cumbersome terms, thus saving text space. For instance, about 15% of PubMed queries include abbreviations, and about 14.8% of all tokens in a clinical note dataset are abbreviations (Islamaj Dogan et al., 2009;Xu et al., 2007;Jin et al., 2019).Considering the widespread use of acronyms in texts, a text processing application, such as question answering or document retrieval, should be able to correctly process the acronyms in the text and find their meanings. To this end, two sub-tasks should be solved: Acronym Identification (AI): to find the acronyms and the phrases that have been abbreviated by the acronyms in the document. In the running example, the acronyms KPI and E2E and the phrase key performance indicator should be extracted. Acronym Disambiguation (AD): to find the right meaning for a given acronym in text. In the running example, the systems should be able to find the right meanings of the two acronyms KPI and E2E. Note that while the meaning of KPI is found in the senten...

show abstract

Evaluating Feature Extraction Methods for Knowledge-based Biomedical Word Sense Disambiguation

Cited by 7 publications

References 20 publications

Weighted Co-Occurrence Bio-Term Graph for Unsupervised Word Sense Disambiguation in the Biomedical Domain

Weighted Co-Occurrence Bio-Term Graph for Unsupervised Word Sense Disambiguation in the Biomedical Domain

Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets

What Does This Acronym Mean? Introducing a New Dataset for Acronym Identification and Disambiguation

Contact Info

Product

Resources

About