Text Representations and Word Embeddings

Egger, Roman

doi:10.1007/978-3-030-88389-8_16

Cited by 19 publications

(13 citation statements)

References 43 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Top2Vec (Angelov, 2020) is a comparatively new algorithm that uses word embeddings. That is, the vectorization of text data makes it possible to locate semantically similar words, sentences, or documents within spatial proximity (Egger, 2022a). For example, words like "mom" and "dad" should be closer than words like "mom" and "apple."…”

Section: Model 3: Top2vecmentioning

confidence: 99%

A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts

Egger

2022

Front. Sociol.

353

View full text Add to dashboard Cite

The richness of social media data has opened a new avenue for social science research to gain insights into human behaviors and experiences. In particular, emerging data-driven approaches relying on topic models provide entirely new perspectives on interpreting social phenomena. However, the short, text-heavy, and unstructured nature of social media content often leads to methodological challenges in both data collection and analysis. In order to bridge the developing field of computational science and empirical social research, this study aims to evaluate the performance of four topic modeling techniques; namely latent Dirichlet allocation (LDA), non-negative matrix factorization (NMF), Top2Vec, and BERTopic. In view of the interplay between human relations and digital media, this research takes Twitter posts as the reference point and assesses the performance of different algorithms concerning their strengths and weaknesses in a social science context. Based on certain details during the analytical procedures and on quality issues, this research sheds light on the efficacy of using BERTopic and NMF to analyze Twitter data.

show abstract

Section: Model 3: Top2vecmentioning

confidence: 99%

A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts

Egger

2022

Front. Sociol.

353

View full text Add to dashboard Cite

show abstract

“…Similar to embedding-based topic modeling approaches (Egger & Yu, 2022), topological data analysis involves representing lists of topics in a vector space. For this purpose, the preprocessed text is converted into numerical representations (Egger, 2022b(Egger, , 2022c.…”

Section: Methodsmentioning

confidence: 99%

Travel While Working Remotely: A Topological Data Analysis of Well-Being in Remote Work Trip Experiences

Chevtaeva

Neuhofer

Egger

et al. 2023

Journal of Travel Research

Self Cite

View full text Add to dashboard Cite

The proliferation of novel work arrangements, accelerated by the COVID-19 pandemic, has led to the emergence of remote work trip experiences in which work is conducted within the context of leisure travel. Remote work trips challenge the dichotomous view of traditional work and leisure domains. Grounded in positive psychology, this exploratory research investigates remote work travel experiences as a new phenomenon under the leisure travel umbrella. Using a data analytics approach based on a topological analysis of 32,881 Instagram posts, the findings revealed 23 distinct elements of remote work trip experiences that potentially influence well-being. The results indicate that traveling may benefit well-being despite not taking any breaks from work. By investigating the emerging trend of remote work trips and by expanding the understanding of how integrated work-travel experiences can influence well-being, this study contributes to the body of literature on both travel and positive psychology alike.

show abstract

“…As Camastra and Vinciarelli mention [ 35 ], using more features than is strictly necessary leads to several problems, pointing out that one of the main problems was the space needed to store the data. As the amount of available information increases, the compression for storage becomes even more critical [ 12 , 36 , 37 ]. Additionally, for the scope of this work, it cannot be ignored that the application of dimensional reduction techniques for reducing pre-computed embedding dimensions neither improves the runtime nor the memory requirement for running the models.…”

Section: Related Workmentioning

confidence: 99%

“…To the best of our knowledge, in the literature, dimension reduction research on embeddings has focused on statistical methods, such as Bag of Words and Term Frequency-Inverse Document Frequency (TF-IDF) [ 27 , 37 ], and classical pre-computed word embeddings, including the popular GloVe or FastText embeddings [ 21 – 24 , 36 , 49 ]. These classical word embeddings are more complex and powerful than statistical methods.…”

Section: Related Workmentioning

confidence: 99%

Exploring Dimensionality Reduction Techniques in Multilingual Transformers

Huertas-García

Martín²,

Huertas‐Tato³

et al. 2022

Cogn Comput

View full text Add to dashboard Cite

In scientific literature and industry, semantic and context-aware Natural Language Processing-based solutions have been gaining importance in recent years. The possibilities and performance shown by these models when dealing with complex Human Language Understanding tasks are unquestionable, from conversational agents to the fight against disinformation in social networks. In addition, considerable attention is also being paid to developing multilingual models to tackle the language bottleneck. An increase in size has accompanied the growing need to provide more complex models implementing all these features without being conservative in the number of dimensions required. This paper aims to provide a comprehensive account of the impact of a wide variety of dimensional reduction techniques on the performance of different state-of-the-art multilingual siamese transformers, including unsupervised dimensional reduction techniques such as linear and nonlinear feature extraction, feature selection, and manifold techniques. In order to evaluate the effects of these techniques, we considered the multilingual extended version of Semantic Textual Similarity Benchmark (mSTSb) and two different baseline approaches, one using the embeddings from the pre-trained version of five models and another using their fine-tuned STS version. The results evidence that it is possible to achieve an average reduction of $$91.58\% \pm 2.59\%$$ 91.58 % ± 2.59 % in the number of dimensions of embeddings from pre-trained models requiring a fitting time $$96.68\% \pm 0.68\%$$ 96.68 % ± 0.68 % faster than the fine-tuning process. Besides, we achieve $$54.65\% \pm 32.20\%$$ 54.65 % ± 32.20 % dimensionality reduction in embeddings from fine-tuned models. The results of this study will significantly contribute to the understanding of how different tuning approaches affect performance on semantic-aware tasks and how dimensional reduction techniques deal with the high-dimensional embeddings computed for the STS task and their potential for other highly demanding NLP tasks.

show abstract

Text Representations and Word Embeddings

Cited by 19 publications

References 43 publications

A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts

A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts

Travel While Working Remotely: A Topological Data Analysis of Well-Being in Remote Work Trip Experiences

Exploring Dimensionality Reduction Techniques in Multilingual Transformers

Contact Info

Product

Resources

About