A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

Erdmann, Alexander; Khalifa, Salam; Oudah, Mai; Habash, Nizar; Bouamor, Houda

doi:10.18653/v1/w19-4214

Cited by 6 publications

(4 citation statements)

References 40 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Their approach paved the way for many computational approaches based on morphologically tagged corpora or lexicons. However, "The complexity of morphological inflection is only a small bit of the larger question of morphological typology" (Cotterell et al 2019: 339), and deriving morphologically tagged corpora from the kind of raw texts we address in this paper is still an open issue, despite recent progress (Erdmann et al 2019;Malouf 2017).…”

Section: [4]mentioning

confidence: 99%

Towards robust complexity indices in linguistic typology

Pellegrino

2022

View full text Add to dashboard Cite

There is high hope that corpus-based approaches to language complexity will contribute to explaining linguistic diversity. Several complexity indices have consequently been proposed to compare different aspects among languages, especially in phonology and morphology. However, their robustness against changes in corpus size and content hasn’t been systematically assessed, thus impeding comparability between studies. Here, we systematically test the robustness of four complexity indices estimated from raw texts and either routinely utilized in crosslinguistic studies (Type-Token Ratio and word-level Entropy) or more recently proposed (Word Information Density and Lexical Diversity). Our results on 47 languages strongly suggest that traditional indices are more prone to fluctuation than the newer ones. Additionally, we confirm with Word Information Density the existence of a cross-linguistic trade-off between word-internal and across-word distributions of information. Finally, we implement a proof of concept suggesting that modern deep-learning language models can improve the comparability across languages with non-parallel datasets.

show abstract

Section: [4]mentioning

confidence: 99%

Towards robust complexity indices in linguistic typology

Pellegrino

2022

View full text Add to dashboard Cite

show abstract

“…However, many researchers worked on integrating semantics in the learning of morphology (Schone and Jurafsky 2000;Narasimhan et al 2015), especially with the advances in neural network-based distributional semantics (Narasimhan et al 2015). Most recently, Erdmann et al (2019) presented a linguistically motivated alternative to greedy or other unsupervised methods, requiring only minimal language-specific input with large unannotated corpora. In their evaluations, they consistently outperform competitive unsupervised baselines and approach the performance of state-of-the-art models such as MADAMIRA (Pasha et al 2014) and Farasa (Abdelali et al 2016).…”

Section: Unsupervised Learning Approaches To Morphological Segmentationmentioning

confidence: 99%

Unsupervised Arabic dialect segmentation for machine translation

Salloum¹,

Habash²

2020

Nat. Lang. Eng.

Self Cite

View full text Add to dashboard Cite

Resource-limited and morphologically rich languages pose many challenges to natural language processing tasks. Their highly inflected surface forms inflate the vocabulary size and increase sparsity in an already scarce data situation. In this article, we present an unsupervised learning approach to vocabulary reduction through morphological segmentation. We demonstrate its value in the context of machine translation for dialectal Arabic (DA), the primarily spoken, orthographically unstandardized, morphologically rich and yet resource poor variants of Standard Arabic. Our approach exploits the existence of monolingual and parallel data. We show comparable performance to state-of-the-art supervised methods for DA segmentation.

show abstract

“…For example, if π contains words wxyxz and axx, b π is xx and the exponents are (<w, y, z>) and (<a), respectively. 7 Inspired by unsupervised maximum matching in greedy tokenization (Guo, 1997;Erdmann et al, 2019), we define the following paradigm score function:…”

Section: Clustering Into Paradigmsmentioning

confidence: 99%

The Paradigm Discovery Problem

Erdmann¹,

Elsner²,

Wu³

et al. 2020

Preprint

Self Cite

View full text Add to dashboard Cite

This work treats the paradigm discovery problem (PDP)-the task of learning an inflectional morphological system from unannotated sentences. We formalize the PDP and develop evaluation metrics for judging systems. Using currently available resources, we construct datasets for the task. We also devise a heuristic benchmark for the PDP and report empirical results on five diverse languages. Our benchmark system first makes use of word embeddings and string similarity to cluster forms by cell and by paradigm. Then, we bootstrap a neural transducer on top of the clustered data to predict words to realize the empty paradigm slots. An error analysis of our system suggests clustering by cell across different inflection classes is the most pressing challenge for future work. Our code and data are publicly available.

show abstract

A Little Linguistics Goes a Long Way: Unsupervised Segmentation with Limited Language Specific Guidance

Cited by 6 publications

References 40 publications

Towards robust complexity indices in linguistic typology

Towards robust complexity indices in linguistic typology

Unsupervised Arabic dialect segmentation for machine translation

The Paradigm Discovery Problem

Contact Info

Product

Resources

About