A Multi-metric Algorithm for Hierarchical Clustering of Same-Length Protein Sequences

Tsarouchis, Sotirios-Filippos; Kotouza, Maria Th.; Psomopoulos, Fotis; Mitkas, Pericles A.

doi:10.1007/978-3-319-92016-0_18

Cited by 2 publications

(1 citation statement)

References 9 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since the main algorithm makes use of the frequency of occurrence of the main terms in the documents, we call it Frequencybased Hierarchical Clustering (FBHC). A relevant clustering method that we presented in one of our previous works [36] makes use of frequency matrices to construct an hierarchy of biological sequences.…”

Section: Document Clusteringmentioning

confidence: 99%

A dockerized framework for hierarchical frequency-based document clustering on cloud computing infrastructures

2020

Self Cite

View full text Add to dashboard Cite

Scalable big data analysis frameworks are of paramount importance in the modern web society, which is characterized by a huge number of resources, including electronic text documents. Document clustering is an important field in text mining and is commonly used for document organization, browsing, summarization and classification. Hierarchical clustering methods construct a hierarchy structure that, combined with the produced clusters, can be useful in managing documents, thus making the browsing and navigation process easier and quicker, and providing only relevant information to the users' queries by leveraging the structure relationships. Nevertheless, the high computational cost and memory usage of baseline hierarchical clustering algorithms render them inappropriate for the vast number of documents that must be handled daily. In this paper, we propose a new scalable hierarchical clustering framework, which uses the frequency of the topics in the documents to overcome these limitations. Our work consists of a binary tree construction algorithm that creates a hierarchy of the documents using three metrics (Identity, Entropy, Bin Similarity), and a branch breaking algorithm which composes the final clusters by applying thresholds to each branch of the tree. The clustering algorithm is followed by a meta-clustering module which makes use of graph theory to obtain insights in the leaf clusters' connections. The feature vectors representing each document derive from topic modeling. At the implementation level, the clustering method has been dockerized in order to facilitate its deployment on cloud computing infrastructures. Finally, the proposed framework is evaluated on several datasets of varying size and content, achieving significant reduction in both memory consumption and computational time over existing hierarchical clustering algorithms. The experiments also include performance testing on cloud resources using different setups and the results are promising.

show abstract

Section: Document Clusteringmentioning

confidence: 99%