Labelling Hierarchical Clusters of Scientific Articles

Peganova, Irina; Rebrova, A. G.; Nedumov, Yaroslav

doi:10.1109/ivmem.2019.00010

Cited by 4 publications

(4 citation statements)

References 9 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Yet, they lack the rich and necessary information regarding the data and are very inefficient (Vahidnia, Abbasi, & Abbass, 2020). However, in addition to their usage in text representation (Zhang, Li et al, 2017;, it is also very common to use TF-IDF for keyword extraction, as it has been incorporated in literature for similar purposes (Awan & Beg, 2020;Peganova, Rebrova, & Nedumov, 2019;Radu et al, 2020).…”

Section: Journal Of Data and Information Sciencementioning

confidence: 99%

“…ComboBasic (Astrakhantsev, 2015). HCBASIC (Peganova et al, 2019) is an adaptation of ComboBasic, also taking the hierarchical structure of clusters into account. Rapid Automatic Keyword Extraction (RAKE) (Rose et al, 2010) is another key-phrase extraction method, which has been utilized successfully in numerous studies for similar purposes, such as a study by Krenn and Zeilinger (Krenn & Zeilinger, 2020), for "concept extraction''.…”

Section: Labeling Of the Clustersmentioning

confidence: 99%

See 1 more Smart Citation

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Vahidnia

Abbasi

Abbass

2021

Journal of Data and Information Science

View full text Add to dashboard Cite

Purpose Detection of research fields or topics and understanding the dynamics help the scientific community in their decisions regarding the establishment of scientific fields. This also helps in having a better collaboration with governments and businesses. This study aims to investigate the development of research fields over time, translating it into a topic detection problem. Design/methodology/approach To achieve the objectives, we propose a modified deep clustering method to detect research trends from the abstracts and titles of academic documents. Document embedding approaches are utilized to transform documents into vector-based representations. The proposed method is evaluated by comparing it with a combination of different embedding and clustering approaches and the classical topic modeling algorithms (i.e. LDA) against a benchmark dataset. A case study is also conducted exploring the evolution of Artificial Intelligence (AI) detecting the research topics or sub-fields in related AI publications. Findings Evaluating the performance of the proposed method using clustering performance indicators reflects that our proposed method outperforms similar approaches against the benchmark dataset. Using the proposed method, we also show how the topics have evolved in the period of the recent 30 years, taking advantage of a keyword extraction method for cluster tagging and labeling, demonstrating the context of the topics. Research limitations We noticed that it is not possible to generalize one solution for all downstream tasks. Hence, it is required to fine-tune or optimize the solutions for each task and even datasets. In addition, interpretation of cluster labels can be subjective and vary based on the readers’ opinions. It is also very difficult to evaluate the labeling techniques, rendering the explanation of the clusters further limited. Practical implications As demonstrated in the case study, we show that in a real-world example, how the proposed method would enable the researchers and reviewers of the academic research to detect, summarize, analyze, and visualize research topics from decades of academic documents. This helps the scientific community and all related organizations in fast and effective analysis of the fields, by establishing and explaining the topics. Originality/value In this study, we introduce a modified and tuned deep embedding clustering coupled with Doc2Vec representations for topic extraction. We also use a concept extraction method as a labeling approach in this study. The effectiveness of the method has been evaluated in a case study of AI publications, where we analyze the AI topics during the past three decades.

show abstract

Section: Journal Of Data and Information Sciencementioning

confidence: 99%

Section: Labeling Of the Clustersmentioning

confidence: 99%

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Vahidnia

Abbasi

Abbass

2021

Journal of Data and Information Science

View full text Add to dashboard Cite

show abstract

“…In literature, many works have been provided about this topic, for instance in [19] authors provide a method for labelling hierarchical clusters of scientific articles; differently from our approach authors define a label for a cluster of documents and not for a single paper. Furthermore, the label is defined as a set of terms extracted from the cluster and not as a complete sentence.…”

Section: Scientific Papers Information Managementmentioning

confidence: 99%

SPUCL (Scientific Publication Classifier): A Human-Readable Labelling System for Scientific Publications

Scarpato¹,

Pieroni²,

Montorsi³

2021

Applied Sciences

View full text Add to dashboard Cite

To assess critically the scientific literature is a very challenging task; in general it requires analysing a lot of documents to define the state-of-the-art of a research field and classifying them. The documents classifier systems have tried to address this problem by different techniques such as probabilistic, machine learning and neural networks models. One of the most popular document classification approaches is the LDA (Latent Dirichlet Allocation), a probabilistic topic model. One of the main issues of the LDA approach is that the retrieved topics are a collection of terms with their probabilities and it does not have a human-readable form. This paper defines an approach to make LDA topics comprehensible for humans by the exploitation of the Word2Vec approach.

show abstract

“…В работах[16][13] предложены специальные методы для извлечения ключевых слов при иерархической кластеризации документов. Предложенные методы могут улучшить качество выделения ключевых слов, но так как ключевые слова в большинстве случаев не известны, то для оценки их работы требуется дополнительная экспертная оценка.3.…”

unclassified

Hierarchical Rubrication of Text Documents

Сорокин¹,

Нужный²,

Савельева³

2020

Proceedings of ISP RAS

View full text Add to dashboard Cite

Topic modeling is an important and widely used method in the analysis of a large collection of documents. It allows us to digest the content of documents by examination of the selected topics. It has drawbacks such as a need to specify the number of topics. The topics can become too local or too global, depending on that number. Also, it does not provide a relation between local and global topics. Here we present an algorithm and a computer program for the hierarchical rubrication of text documents. The program solves these problems by creating a hierarchy of automatically selected topics in which local topics are connected of the global topics. The program processes PDF documents split them into text segments, builds vector representations using word2vec model and stores them in a database. The vector embeddings are structured in the form of a hierarchy of automatically constructed categories. Each category is identified by automatically selected keywords. The result is visualized in an interactive map. Traversing the hierarchy of topics is done by zooming the map. An analysis of the constructed hierarchy of categories allows us to evaluate the minimum and maximum depth of the hierarchy corresponding to a minimum and a maximum number of different topics contained in the collection of documents. The program was tested on documents on deep nuclear waste disposal.

show abstract

Labelling Hierarchical Clusters of Scientific Articles

Cited by 4 publications

References 9 publications

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

Embedding-based Detection and Extraction of Research Topics from Academic Documents Using Deep Clustering

SPUCL (Scientific Publication Classifier): A Human-Readable Labelling System for Scientific Publications

Hierarchical Rubrication of Text Documents

Contact Info

Product

Resources

About