Evaluation of vector embedding models in clustering of text documents

Walkowiak, Tomasz; Gniewkowski, Mateusz

doi:10.26615/978-954-452-056-4_149

Cited by 8 publications

(3 citation statements)

References 16 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This transformation allows for the normalization of data by applying a power transformation that can handle both positive and negative values. The studies by Walkowiak and Gniewkowski (2019) [40] and Bisandu et al (2022) [41] highlighted the effectiveness of the Yeo-Johnson transformation in standardizing data and producing well-organized datasets that are easier to work with. Therefore, this study used Yeo-Johnson transformation to deal with the skewed data.…”

Section: Data Processing 421 Yeo-johnson Transformationmentioning

confidence: 99%

“…In this work, the process of text data processing begins with tokenization, where the raw text is broken down into smaller segments known as tokens, which may include both individual words and meaningful phrases [40]. This crucial step allows a natural language processing (NLP) system to assign a unique numerical ID to each token, facilitating further analysis.…”

Section: Text Preprocessing Processmentioning

confidence: 99%

See 1 more Smart Citation

Development of a Cost Prediction Model for Design Changes: Case of Korean Apartment Housing Projects

Ahn,

Kim,

Lee

2024

Sustainability

View full text Add to dashboard Cite

Apartment buildings are significantly popular among South Korean construction companies. However, design changes present a common yet challenging aspect, often leading to cost overruns. Traditional cost prediction methods, which primarily rely on numerical data, have a gap in fully capitalizing on the rich insights that textual descriptions of design changes offer. Addressing this gap, this research employs machine learning (ML) and natural language processing (NLP) techniques, analyzing a dataset of 35,194 instances of design changes from 517 projects by a major public real estate developer. The proposed models demonstrate acceptable performance, with R-square values ranging from 0.930 to 0.985, underscoring the potential of integrating structured and unstructured data for enhanced predictive analytics in construction project management. The predictor using Extreme Gradient Boosting (XGB) shows better predictive ability (R2 = 0.930; MAE = 16.05; RMSE = 75.09) compared to the traditional Multilinear Regression (MLR) model (R2 = 0.585; MAE = 43.85; RMSE = 101.41). For whole project cost changes predictions, the proposed models exhibit good predictive ability, both including price fluctuations (R2 = 0.985; MAE = 605.1; RMSE = 1009.5) and excluding price fluctuations (R2 = 0.982; MAE = 302.1; RMSE = 548.5). Additionally, a stacked model combining CatBoost and Support Vector Machine (SVM) algorithms was developed, showcasing the effective prediction of cost changes, with or without price fluctuations.

show abstract

Section: Data Processing 421 Yeo-johnson Transformationmentioning

confidence: 99%

Section: Text Preprocessing Processmentioning

confidence: 99%

Development of a Cost Prediction Model for Design Changes: Case of Korean Apartment Housing Projects

Ahn,

Kim,

Lee

2024

Sustainability

View full text Add to dashboard Cite

show abstract

“…Few works related to Transformer embeddings and entity embeddings are devoted to the purpose of text-clustering [5,21]. In [31], several text representations (CBOW, BERT, ELMo, etc.) are compared by performing popular clustering algorithms such as Kmeans, SpectralClustering.…”

Section: Introductionmentioning

confidence: 99%

Tensor-based Graph Modularity for Text Data Clustering

Boutalbi

Ait-Saada

Iurshina

et al. 2022

Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

View full text Add to dashboard Cite

Graphs are used in several applications to represent similarities between instances. For text data, we can represent texts by different features such as bag-of-words, static embeddings (Word2vec, GloVe, etc.), and contextual embeddings (BERT, RoBERTa, etc.), leading to multiple similarities (or graphs) based on each representation. The proposal posits that incorporating the local invariance within every graph and the consistency across different graphs leads to a consensus clustering that improves the document clustering. This problem is complex and challenged with the sparsity and the noisy data included in each graph. To this end, we rely on the modularity metric, which effectively evaluates graph clustering in such circumstances. Therefore, we present a novel approach for text clustering based on both a sparse tensor representation and graph modularity. This leads to cluster texts (nodes) while capturing information arising from the different graphs. We iteratively maximize a Tensor-based Graph Modularity criterion. Extensive experiments on benchmark text clustering datasets are performed, showing that the proposed algorithm referred to as Tensor Graph Modularity -TGM-outperforms other baseline methods in terms of clustering task. The source code is available at https://github.com/TGMclustering/TGMclustering.

show abstract

Subject Classification of Texts in Polish - from TF-IDF to Transformers

Walkowiak

2021

Theory and Engineering of Dependable Computer Systems and Networks

View full text Add to dashboard Cite

Evaluation of vector embedding models in clustering of text documents

Cited by 8 publications

References 16 publications

Development of a Cost Prediction Model for Design Changes: Case of Korean Apartment Housing Projects

Development of a Cost Prediction Model for Design Changes: Case of Korean Apartment Housing Projects

Tensor-based Graph Modularity for Text Data Clustering

Subject Classification of Texts in Polish - from TF-IDF to Transformers

Contact Info

Product

Resources

About