2007
DOI: 10.1007/978-3-540-74628-7_24
|View full text |Cite
|
Sign up to set email alerts
|

Automatic Diacritic Restoration for Resource-Scarce Languages

Abstract: Abstract. The orthography of many resource-scarce languages includes diacritically marked characters. Falling outside the scope of the standard Latin encoding, these characters are often represented in digital language resources as their unmarked equivalents. This renders corpus compilation more difficult, as these languages typically do not have the benefit of large electronic dictionaries to perform diacritic restoration. This paper describes experiments with a machine learning approach that is able to autom… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
42
2
1

Year Published

2010
2010
2021
2021

Publication Types

Select...
5
2
2

Relationship

0
9

Authors

Journals

citations
Cited by 30 publications
(45 citation statements)
references
References 6 publications
0
42
2
1
Order By: Relevance
“…Examples of applications in the morpho-phonological and speech areas are hyphenation and syllabification (Daelemans and Van den Bosch, 1992); classifiying phonemes in speech (Kocsor et al, 2000); assignment of word stress (Daelemans, Gillis, and Durieux, 1994); grapheme-to-phoneme conversion, (Van den Bosch and Daelemans and Van den Bosch, 1996;Canisius, Van den Bosch, and Daelemans, 2006); diminutive formation (Daelemans et al, 1998); and morphological analysis (Van den Bosch, Daelemans, and Weijters, 1996;Van den Bosch and Daelemans, 1999;Canisius, Van den Bosch, and Daelemans, 2006). Although these examples are applied mostly to Germanic languages (English, Dutch, and German), applications to other languages with more complicated writing systems or morphologies, or with limited resources, have also been presented: for example, letter-phoneme conversion in Scottish Gaelic (Wolters and Van den Bosch, 1997), morphological analysis of Arabic (Marsi, Van den Bosch, and Soudi, 2006), or diacritic restoration in languages with a diacritic-rich writing system (Mihalcea, 2002;De Pauw, Waiganjo, and De Schryver, 2007). …”
Section: Nlp Applications Of Timblmentioning
confidence: 99%
“…Examples of applications in the morpho-phonological and speech areas are hyphenation and syllabification (Daelemans and Van den Bosch, 1992); classifiying phonemes in speech (Kocsor et al, 2000); assignment of word stress (Daelemans, Gillis, and Durieux, 1994); grapheme-to-phoneme conversion, (Van den Bosch and Daelemans and Van den Bosch, 1996;Canisius, Van den Bosch, and Daelemans, 2006); diminutive formation (Daelemans et al, 1998); and morphological analysis (Van den Bosch, Daelemans, and Weijters, 1996;Van den Bosch and Daelemans, 1999;Canisius, Van den Bosch, and Daelemans, 2006). Although these examples are applied mostly to Germanic languages (English, Dutch, and German), applications to other languages with more complicated writing systems or morphologies, or with limited resources, have also been presented: for example, letter-phoneme conversion in Scottish Gaelic (Wolters and Van den Bosch, 1997), morphological analysis of Arabic (Marsi, Van den Bosch, and Soudi, 2006), or diacritic restoration in languages with a diacritic-rich writing system (Mihalcea, 2002;De Pauw, Waiganjo, and De Schryver, 2007). …”
Section: Nlp Applications Of Timblmentioning
confidence: 99%
“…The authors in [6] stated that character-level diacritic restoration is premised on the hypothesis that "the local graphemic context encodes sufficient information to solve the disambiguation problem" of diacritic restoration. They are much simpler, faster, easier to implement, and do not require language-specific resources [17].…”
Section: Letter-level Restorationmentioning
confidence: 99%
“…They are much simpler, faster, easier to implement, and do not require language-specific resources [17]. Character-level features are extracted from the training data from which models are learned via machine-learning algorithms, such as Decision Trees, Instance-based algorithms, and Bayesian classifiers [6,13,18]. Studies on various languages have shown the wide applicability of the character-level model (especially for resource-scarce languages).…”
Section: Letter-level Restorationmentioning
confidence: 99%
“…They are using a generative statistical model for this purpose. De Pauw et al (2007) also test their MBL (memory based learning) model on different languages. Although they do not test for Turkish, the most attractive part of theirs results is that the performances for highly inflectional languages differ sharply from the others towards the negative side.…”
Section: Related Workmentioning
confidence: 99%