2021
DOI: 10.1162/tacl_a_00365
|View full text |Cite
|
Sign up to set email alerts
|

Morphology Matters: A Multilingual Language Modeling Analysis

Abstract: Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
6
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(8 citation statements)
references
References 17 publications
(42 reference statements)
2
6
0
Order By: Relevance
“…This suggests that the tokenization choice could act as an inductive bias for seq2seq models, and character-level framing could be useful even for tasks that are not truly character-level. This observation also aligns with the findings of the recent work on language modeling complexity (Park et al, 2021;Mielke et al, 2019). For many languages, including several Slavic ones related to the Serbian-Bosnian pair, a character-level language model yields lower surprisal than the one trained on BPE units, suggesting that the effect might also be explained by the character tokenization making the language easier to language-model.…”
Section: Results and Analysissupporting
confidence: 89%
“…This suggests that the tokenization choice could act as an inductive bias for seq2seq models, and character-level framing could be useful even for tasks that are not truly character-level. This observation also aligns with the findings of the recent work on language modeling complexity (Park et al, 2021;Mielke et al, 2019). For many languages, including several Slavic ones related to the Serbian-Bosnian pair, a character-level language model yields lower surprisal than the one trained on BPE units, suggesting that the effect might also be explained by the character tokenization making the language easier to language-model.…”
Section: Results and Analysissupporting
confidence: 89%
“…The sentencepiece library was used in our research article. Next, the text is segmented into variable-length subword units using unigram language modeling ( Park et al, 2021 ). The subword units are then sorted based on their frequency of occurrence in the corpus , and a predefined number of units are selected to form the final vocabulary.…”
Section: Methodsmentioning
confidence: 99%
“…The LM establishes the statistical relationship between words in the language. Word-level statistical modeling of morphologically complex languages can not achieve the word sequence prediction capabilities of simple morphology languages [8,9]. Additionally, the finite-sized word vocabulary of a pronunciation lexicon does not cover complex word forms and loan words that appear in a real-world setting.…”
Section: Deep Neural Network-hidden Markov Model (Dnn-hmm)-based Asr ...mentioning
confidence: 99%
“…A syllabification algorithm tailored for Malayalam script using finite state transducers has been proposed in [34]. The linguistic rules for syllable tokenization described in Table 6 have been computationally implemented as in Algorithm 1 and made available in the Mlphon Python library 9 . The algorithm analyzes the input text sequence and determines whether it falls into one of the four allowable categories of syllable structures in Malayalam.…”
Section: Syllable Tokensmentioning
confidence: 99%
See 1 more Smart Citation