Canonical and Surface Morphological Segmentation for Nguni Languages

Moeng, Tumi; Reay, Sheldon; Daniels, Aaron; Buys, Jan

doi:10.1007/978-3-030-95070-5_9

Cited by 5 publications

(6 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When widening the search for related literature further, we see that there exists a verb parser and generator (Pretorius et al (2017)), morphological analysers (Pretorius & Bosch (2009), ), morphological segmenters (e.g., Mzamo et al (2019), Moeng et al (2021)), language models (e.g., Myoya et al (2023)), and a Grammatical Framework (GF) grammar…”

Section: Related Workmentioning

confidence: 99%

“…Specifically, the morphological analysers and canonical segmenters take a word as input and produce canonical morphemes, hence they need to capture phonological conditioning rules, even if that is in an implicit manner, to be able to reverse them. For instance, Moeng et al (2021)'s models can only generate the canonical morphemes nga-i-zin-konzo when given ngezinkonzo 'by the services' if they model the reversal of the phonological conditioning rule a + i → e. These resources also cannot uncover new rules and additional limitations to this type of work are as follows:…”

Section: Related Workmentioning

confidence: 99%

“…Since the languages are underresourced and grammatically complex, prevalent neural text generation architectures are likely to struggle to capture the linguistic phenomena when forming individual words. This is implied by the performance of neural models created to undo phonological conditioning -canonical segmentation in Nguni languages (Moeng et al (2021)). Moeng et al (2021)'s best performing canonical segmentation models have F1 scores around 0.7 [3].…”

Section: Introductionmentioning

confidence: 99%

“…This is implied by the performance of neural models created to undo phonological conditioning -canonical segmentation in Nguni languages (Moeng et al (2021)). Moeng et al (2021)'s best performing canonical segmentation models have F1 scores around 0.7 [3]. However, analysis of the code [4] shows that the true performance is likely to be much lower since the quantification of the performance does not take the order and completeness of the morphemes into account.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Algorithm for Assisting Grammarians when Extracting Phonological Conditioning Rules for Nguni languages

Mahlaza,

Khumalo

2024

JDHASA

View full text Add to dashboard Cite

Text generation models, the core technology that underpins chatbots such as ChatGPT [1], that are created to support morphologically complex African languages require the modelling of subword processes such as phonological conditioning. Since we rely on explicit phonological conditioning rules that are manually identified by grammarians to determine the extent to which such models are able to perform for such languages, there is a need to assist grammarians via computational solutions to increase their coverage of known rules. At present, there are no existing algorithms to extract the rules for such processes and therefore enable the creation of building better text generation models. We present a new algorithm for extracting phonological conditioning rules for Nguni languages. All the rules extracted by the algorithm are valid when the input word and associated morphemes are judged to be valid. The algorithm has the potential to improve the productivity of grammarians and enable the creation of modern text generation technologies that support and promote under-resourced languages.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Algorithm for Assisting Grammarians when Extracting Phonological Conditioning Rules for Nguni languages

Mahlaza,

Khumalo

2024

JDHASA

View full text Add to dashboard Cite

show abstract

“…Nevertheless, BPE does not necessarily help in situations where knowing a sensical segmentation of linguistic-like units is important, such as attempting to model the ways in which children acquire language (Goldwater et al, 2009), segmenting free-flowing speech (Kamper et al, 2016;Rasanen and Blandon, 2020), creating linguistic tools for morphologically complex languages (Moeng et al, 2021), or studying the structure of an endangered language with few or no current speakers (Dunbar et al, 2020).…”

Section: Introductionmentioning

confidence: 99%

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Downey¹,

Xia²,

Levow³

et al. 2022

Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology

View full text Add to dashboard Cite

We introduce a Masked Segmental Language Model (MSLM) for joint language modeling and unsupervised segmentation. While near-perfect supervised methods have been developed for segmenting human-like linguistic units in resource-rich languages such as Chinese, many of the world's languages are both morphologically complex, and have no large dataset of "gold" segmentations for supervised training. Segmental Language Models offer a unique approach by conducting unsupervised segmentation as the byproduct of a neural language modeling objective. However, current SLMs are limited in their scalability due to their recurrent architecture. We propose a new type of SLM for use in both unsupervised and lightly supervised segmentation tasks. The MSLM is built on a span-masking transformer architecture, harnessing a masked bidirectional modeling context and attention, as well as adding the potential for model scalability. In a series of experiments, our model outperforms the segmentation quality of recurrent SLMs on Chinese, and performs similarly to the recurrent model on English.

show abstract

Parsing IsiZulu Text Using Grammatical Framework

Marais

Pretorius

2023

Lecture Notes in Networks and Systems

View full text Add to dashboard Cite

Canonical and Surface Morphological Segmentation for Nguni Languages

Cited by 5 publications

References 13 publications

Algorithm for Assisting Grammarians when Extracting Phonological Conditioning Rules for Nguni languages

Algorithm for Assisting Grammarians when Extracting Phonological Conditioning Rules for Nguni languages

A Masked Segmental Language Model for Unsupervised Natural Language Segmentation

Parsing IsiZulu Text Using Grammatical Framework

Contact Info

Product

Resources

About