Large scale biomedical texts classification: a kNN and an ESA-based approaches

Dramé, Khadim; Mougin, Fleur; Diallo, Gayo

doi:10.1186/s13326-016-0073-1

Cited by 22 publications

(11 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…They reported that the proposed methodology indicated reasonable classification performance. Drame et al [28] proposed k -nearest neighbours (kNN) and an explicit semantic analysis based approach for large-scale biomedical document classification on a subset of MEDLINE documents. They stated that the kNN-based method with the RF learning algorithm achieved good performances compared with the current state-of-the-art methods.…”

Section: Related Workmentioning

confidence: 99%

On classification of abstracts obtained from medical journals

Parlak

Uysal

2019

Journal of Information Science

View full text Add to dashboard Cite

Classification of medical documents was mostly carried out on English data sets and these studies were performed on hospital records rather than academic texts. The main reasons behind this situation are the lack of publicly available data sets and the tasks being costly and time-consuming. As the first contribution of this study, two data sets including Turkish and English counterparts of the same abstracts published in Turkish medical journals were constructed. Turkish is one of the widely used agglutinative languages worldwide and English is a good example of non-agglutinative languages. While English abstracts were obtained automatically from MEDLINE database with a computer program, Turkish counterparts of these documents were collected manually from the Internet. As the second contribution of this study, an extensive comparison on classification of abstracts obtained from Turkish medical journals was made by using these two equivalent data sets. Features were extracted from text documents with three different approaches: unigram, bigram and hybrid. Hybrid approach includes a combination of unigram and bigram features. In the experiments, three different feature selection methods and seven different classifiers were utilised. According to the results on both data sets, classification performance of the English abstracts outperformed the Turkish counterparts. Maximum accuracies were obtained from the combination of unigram features, distinguishing feature selector (DFS) and multinomial naïve Bayes (MNB) classifier for both data sets. Unigram features were generally more efficient than bigram and hybrid features. However, analysis of top-10 features indicated that nearly half of the features were translations of each other for Turkish and English data sets.

show abstract

Section: Related Workmentioning

confidence: 99%

On classification of abstracts obtained from medical journals

Parlak

Uysal

2019

Journal of Information Science

View full text Add to dashboard Cite

show abstract

“…Recall value is more relevant in this case since it shows the proportion of the correct annotations that an approach was able to discover. The choice of concepts for annotating documents is quite subjective and attaining high recall values remain a challenge [8].…”

Section: Resultsmentioning

confidence: 99%

“…One variant of KNN ranks candidate concepts by combining the relevance scores of documents for which they form annotations [6,7]. Another variant passes the features of candidate concepts to a machine classifier which determines which concepts to put forward for annotation [8]. Some features that are used by a classifier include the proportion of retrieved documents that were annotated with the concept and if the concept appears in the title or content of a document.…”

Section: Related Workmentioning

confidence: 99%

“…Some features that are used by a classifier include the proportion of retrieved documents that were annotated with the concept and if the concept appears in the title or content of a document. Experimental results show that KNN or hybrids of it are most effective in recommending annotations [2,8]. However, supervised approaches cannot be used when a corpus of annotated documents does not exist.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Taxonomic Corpus-Based Concept Summary Generation for Document Annotation

Nkisi-Orji

Wiratunga

Hui

et al. 2017

Research and Advanced Technology for Digital Libraries

View full text Add to dashboard Cite

“…Furthermore, WikiRelate [7] was the first work which compute the measures of semantic relatedness using Wikipedia, this approach applied the familiar technique used in semantic relatedness based on wordnet and modified it to be used in Wikipedia, such as path-length measure [8], but in general the results are similar. However, Gabrilovich and Markovitch (2007) [5] propose a new approach with Explicit Semantic Analysis (ESA) that achieve highly accurate results, this method has been extensively studied in many applications [9]. ESA use Wikipedia as a semantic interpreter and builds a weighted inverted vector that maps each term into a list of Wikipedia articles in which it appears, and computes the similarity between vectors generated from two terms or texts.…”

Section: Introductionmentioning

confidence: 99%

Towards optimize-ESA for text semantic similarity: A case study of biomedical text

Mrhar

Abik

2020

IJECE

View full text Add to dashboard Cite

Explicit Semantic Analysis (ESA) is an approach to measure the semantic relatedness between terms or documents based on similarities to documents of a references corpus usually Wikipedia. ESA usage has received tremendous attention in the field of natural language processing NLP and information retrieval. However, ESA utilizes a huge Wikipedia index matrix in its interpretation by multiplying a large matrix by a term vector to produce a high-dimensional vector. Consequently, the ESA process is too expensive in interpretation and similarity steps. Therefore, the efficiency of ESA will slow down because we lose a lot of time in unnecessary operations. This paper propose enhancements to ESA called optimize-ESA that reduce the dimension at the interpretation stage by computing the semantic similarity in a specific domain. The experimental results show clearly that our method correlates much better with human judgement than the full version ESA approach.

show abstract

Large scale biomedical texts classification: a kNN and an ESA-based approaches

Cited by 22 publications

References 26 publications

On classification of abstracts obtained from medical journals

On classification of abstracts obtained from medical journals

Taxonomic Corpus-Based Concept Summary Generation for Document Annotation

Towards optimize-ESA for text semantic similarity: A case study of biomedical text

Contact Info

Product

Resources

About