In search of isoglosses: continuous and discrete language embeddings in Slavic historical phonology

Cathcart, Chundra; Wandl, Florian

doi:10.18653/v1/2020.sigmorphon-1.28

Cited by 3 publications

(7 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A possible explanation is that by including additional embeddings in our models designed to capture different patterns of sound change in different morphological, semantic and etymological scenarios, we have filtered out critical information relevant to subgrouping, removing valuable genetic signal displayed by morphological traits, which may explain why the model with language embeddings outperforms the other models. A similar negative relationship between model accuracy and genetic signal displayed by embeddings was found by Cathcart and Wandl (2020).…”

Section: Genetic Signalsupporting

confidence: 72%

“…It is not always straightforward to interpret the sources of differentiation among these embeddings; typically, embeddings based on synchronic patterns of language use in corpora may be due to word order patterns, phonotactic patterns, or a number of other interrelated language-specific distributions. Cathcart and Wandl (2020) investigate the patterns of sound change captured by a neural encoder-decoder architecture trained on Proto-Slavic and contemporary Slavic word forms, and find that embeddings dispay at least partial genetic signal, but also note a negative relationship between overall model accuracy and the degree to which embeddings reflect the communis opinio subgrouping of Slavic.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Disentangling dialects: a neural approach to Indo-Aryan historical phonology and subgrouping

Cathcart¹,

Rama

2020

Proceedings of the 24th Conference on Computational Natural Language Learning

Self Cite

View full text Add to dashboard Cite

This paper seeks to uncover patterns of sound change across Indo-Aryan languages using an LSTM encoder-decoder architecture. We augment our models with embeddings representing language ID, part of speech, and other features such as word embeddings. We find that a highly augmented model shows highest accuracy in predicting held-out forms, and investigate other properties of interest learned by our models' representations. We outline extensions to this architecture that can better capture variation in Indo-Aryan sound change.

show abstract

Section: Genetic Signalsupporting

confidence: 72%

Section: Related Workmentioning

confidence: 99%

Disentangling dialects: a neural approach to Indo-Aryan historical phonology and subgrouping

Cathcart¹,

Rama

2020

Proceedings of the 24th Conference on Computational Natural Language Learning

Self Cite

View full text Add to dashboard Cite

show abstract

“…In this paper, we investigate the usefulness of word prediction as an intermediate task that may allow us to arrive at computational methods in historical linguistics. The use of word prediction in historical linguistics was first proposed in the first author's master's thesis (Dekker 2018) and independently by Ciobanu and Dinu (2018), followed by recent approaches (List 2019a;Meloni et al 2019;Cathcart and Wandl 2020;Cathcart and Rama 2020;Fourrier and Sagot 2020a). Word prediction is a methodology that enables the use of surface word forms as data (like phenotypic methods), while still capturing the genetic signal through sound correspondences (like genotypic methods), thus allowing for reliable reconstructions of language relationship based on large amounts of data.…”

Section: Word Predictionmentioning

confidence: 99%

“…Multiple factors which could lead to an effective use of prediction methods in historical linguistics were evaluated: the choice of machine learning model and encoding of the input data. We evaluated existing models of word prediction (Ciobanu and Dinu 2018;Meloni et al 2019;Cathcart and Wandl 2020;Fourrier [ 321 ] and Sagot 2020a) and came up with our own model, which enables applications on several tasks in historical linguistics. In this paper, we have proposed new approaches for phylogenetic tree reconstruction and cognate detection, based on word prediction error.…”

Section: Contributionmentioning

confidence: 99%

Word prediction in computational historical linguistics

Dekker

Zuidema

2021

JLM

View full text Add to dashboard Cite

In this paper, we investigate how the prediction paradigm from machine learning and Natural Language Processing (NLP) can be put to use in computational historical linguistics. We propose word prediction as an intermediate task, where the forms of unseen words in some target language are predicted from the forms of the corresponding words in a source language. Word prediction allows us to develop algorithms for phylogenetic tree reconstruction, sound correspondence identification and cognate detection, in ways close to attested methods for linguistic reconstruction. We will discuss different factors, such as data representation and the choice of machine learning model, that have to be taken into account when applying prediction methods in historical linguistics. We present our own implementations and evaluate them on different tasks in historical linguistics.

show abstract

“…The most relevant analysis to ours is the recent work by Cathcart and Wandl (2020), in which the authors have trained a neural sequence-to-sequence model on a Slavic etymological dictionary. Their model was trained to consume a reconstructed Proto-Slavic word form and a language embedding, then emit a word form in the modern language specified by the language embedding.…”

Section: Language Representations In Continuous Vector Spacesmentioning

confidence: 99%

Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification

Abdullah,

Kudera,

Avgustinova

et al. 2020

Preprint

View full text Add to dashboard Cite

Deep neural networks have been employed for various spoken language recognition tasks, including tasks that are multilingual by definition such as spoken language identification. In this paper, we present a neural model for Slavic language identification in speech signals and analyze its emergent representations to investigate whether they reflect objective measures of language relatedness and/or non-linguists' perception of language similarity. While our analysis shows that the language representation space indeed captures language relatedness to a great extent, we find perceptual confusability between languages in our study to be the best predictor of the language representation similarity.

show abstract

In search of isoglosses: continuous and discrete language embeddings in Slavic historical phonology

Cited by 3 publications

References 39 publications

Disentangling dialects: a neural approach to Indo-Aryan historical phonology and subgrouping

Disentangling dialects: a neural approach to Indo-Aryan historical phonology and subgrouping

Word prediction in computational historical linguistics

Rediscovering the Slavic Continuum in Representations Emerging from Neural Models of Spoken Language Identification

Contact Info

Product

Resources

About