Sidsel Boldsen scite author profile

In this work we propose a data-driven methodology for identifying temporal trends in a corpus of medieval charters. We have used perplexities derived from RNNs as a distance measure between documents and then, performed clustering on those distances. We argue that perplexities calculated by such language models are representative of temporal trends. The clusters produced using the K-Means algorithm give an insight of the differences in language in different time periods at least partly due to language change. We suggest that the temporal distribution of the individual clusters might provide a more nuanced picture of temporal trends compared to discrete bins, thus providing better results when used in a classification task.

show abstract

Letters From the Past: Modeling Historical Sound Change Through Diachronic Character Embeddings

Boldsen¹,

Paggio²

2022

Preprint

View full text Add to dashboard Cite

While a great deal of work has been done on NLP approaches to lexical semantic change detection, other aspects of language change have received less attention from the NLP community. In this paper, we address the detection of sound change through historical spelling. We propose that a sound change can be captured by comparing the relative distance through time between the distributions of the characters involved before and after the change has taken place. We model these distributions using PPMI character embeddings. We verify this hypothesis in synthetic data and then test the method's ability to trace the well-known historical change of lenition of plosives in Danish historical sources. We show that the models are able to identify several of the changes under consideration and to uncover meaningful contexts in which they appeared. The methodology has the potential to contribute to the study of open questions such as the relative chronology of sound shifts and their geographical distribution.

show abstract

Letters From the Past: Modeling Historical Sound Change Through Diachronic Character Embeddings

Boldsen¹,

Paggio²

2022

View full text Add to dashboard Cite

show abstract

The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations

Agirrezabal¹,

Boldsen²,

Hollenstein³

2023

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Sidsel Boldsen

Interpreting Character Embeddings With Perceptual Representations: The Case of Shape, Sound, and Color

Identifying Temporal Trends Based on Perplexity and Clustering: Are We Looking at Language Change?

Letters From the Past: Modeling Historical Sound Change Through Diachronic Character Embeddings

Letters From the Past: Modeling Historical Sound Change Through Diachronic Character Embeddings

The Hidden Folk: Linguistic Properties Encoded in Multilingual Contextual Character Representations

Contact Info

Product

Resources

About