Proceedings - Natural Language Processing in a Deep Learning World 2019
DOI: 10.26615/978-954-452-056-4_149
|View full text |Cite
|
Sign up to set email alerts
|

Evaluation of vector embedding models in clustering of text documents

Abstract: The paper presents an evaluation of word embedding models in clustering of texts in the Polish language. Authors verified six different embedding models, starting from widely used word2vec, across fast-Text with character n-grams embedding, to deep learning-based ELMo and BERT. Moreover, four standardisation methods, three distance measures and four clustering methods were evaluated. The analysis was performed on two corpora of texts in Polish classified into subjects. The Adjusted Mutual Information (AMI) met… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 8 publications
(3 citation statements)
references
References 16 publications
0
3
0
Order By: Relevance
“…This transformation allows for the normalization of data by applying a power transformation that can handle both positive and negative values. The studies by Walkowiak and Gniewkowski (2019) [40] and Bisandu et al (2022) [41] highlighted the effectiveness of the Yeo-Johnson transformation in standardizing data and producing well-organized datasets that are easier to work with. Therefore, this study used Yeo-Johnson transformation to deal with the skewed data.…”
Section: Data Processing 421 Yeo-johnson Transformationmentioning
confidence: 99%
See 1 more Smart Citation
“…This transformation allows for the normalization of data by applying a power transformation that can handle both positive and negative values. The studies by Walkowiak and Gniewkowski (2019) [40] and Bisandu et al (2022) [41] highlighted the effectiveness of the Yeo-Johnson transformation in standardizing data and producing well-organized datasets that are easier to work with. Therefore, this study used Yeo-Johnson transformation to deal with the skewed data.…”
Section: Data Processing 421 Yeo-johnson Transformationmentioning
confidence: 99%
“…In this work, the process of text data processing begins with tokenization, where the raw text is broken down into smaller segments known as tokens, which may include both individual words and meaningful phrases [40]. This crucial step allows a natural language processing (NLP) system to assign a unique numerical ID to each token, facilitating further analysis.…”
Section: Text Preprocessing Processmentioning
confidence: 99%
“…Few works related to Transformer embeddings and entity embeddings are devoted to the purpose of text-clustering [5,21]. In [31], several text representations (CBOW, BERT, ELMo, etc.) are compared by performing popular clustering algorithms such as Kmeans, SpectralClustering.…”
Section: Introductionmentioning
confidence: 99%