2017
DOI: 10.3390/computation5030034
|View full text |Cite
|
Sign up to set email alerts
|

Tensor-Based Semantically-Aware Topic Clustering of Biomedical Documents

Abstract: Abstract:Biomedicine is a pillar of the collective, scientific effort of human self-discovery, as well as a major source of humanistic data codified primarily in biomedical documents. Despite their rigid structure, maintaining and updating a considerably-sized collection of such documents is a task of overwhelming complexity mandating efficient information retrieval for the purpose of the integration of clustering schemes. The latter should work natively with inherently multidimensional data and higher order i… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2018
2018
2021
2021

Publication Types

Select...
3
3
2

Relationship

2
6

Authors

Journals

citations
Cited by 15 publications
(7 citation statements)
references
References 44 publications
0
7
0
Order By: Relevance
“…After these steps, we are left with 128,359 unique articles published in 10,321 journals by 105,300 distinct first authors in the corpus. The stop-words are then removed from the articles and the text is tokenized using two SpaCy biomedical 6 [14] and English parsers [9]. The word tokens are used to remove names, words with numerical values, and special characters.…”
Section: Data Pre-processingmentioning
confidence: 99%
“…After these steps, we are left with 128,359 unique articles published in 10,321 journals by 105,300 distinct first authors in the corpus. The stop-words are then removed from the articles and the text is tokenized using two SpaCy biomedical 6 [14] and English parsers [9]. The word tokens are used to remove names, words with numerical values, and special characters.…”
Section: Data Pre-processingmentioning
confidence: 99%
“…After these steps, we are left with 128,359 unique articles published in 10,321 journals by 105,300 distinct first authors in the corpus. The stop-words are then removed from the articles and the text is tokenized using SpaCy biomedical 6 [14] and English parsers [9]. The word tokens are used to remove names, words with numerical values, and special characters.…”
Section: Data Pre-processingmentioning
confidence: 99%
“…Drakopoulos et al built a tensor with the dimensions Term x Keyword x Document which is a generalization of the term-document matrix. They use TF-IDF values as tensor entries, and the clustering is done using k-means [6]. In our work, we perform analysis over a four-dimensional tensor and find that components extracted via factorization can separate the documents, authors, and journals into groups and extract topic keywords.…”
Section: Introductionmentioning
confidence: 99%
“…A semisupervised clustering algorithm was employed at the stage of document clustering. Drakopoulos et al [4] compared three different document representations for biomedical document clustering. They found with the increase in the size of the document set, the performance decreased.…”
Section: Related Workmentioning
confidence: 99%