Proceedings of the Human Language Technology Conference of the NAACL, Companion Volume: Short Papers on XX - NAACL '06 2006
DOI: 10.3115/1614049.1614062
|View full text |Cite
|
Sign up to set email alerts
|

Arabic preprocessing schemes for statistical machine translation

Abstract: In this paper, we study the effect of different word-level preprocessing decisions for Arabic on SMT quality. Our results show that given large amounts of training data, splitting off only proclitics performs best. However, for small amounts of training data, it is best to apply English-like tokenization using part-of-speech tags, and sophisticated morphological analysis and disambiguation. Moreover, choosing the appropriate preprocessing produces a significant increase in BLEU score if there is a change in ge… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

2
102
0

Year Published

2007
2007
2019
2019

Publication Types

Select...
4
4
1

Relationship

2
7

Authors

Journals

citations
Cited by 115 publications
(104 citation statements)
references
References 11 publications
2
102
0
Order By: Relevance
“…Her results showed that morphological preprocessing helps, but only for smaller corpora. Habash and Sadat (2006) and Sadat and Habash (2006) reached similar conclusions on a much larger set of experiments including multiple preprocessing schemes reflecting different levels of morphological representation and multiple techniques for disambiguation and tokenization. They showed that specific preprocessing decisions can have a positive effect when decoding text with a different genre than that of the training data (in essence another form of data sparsity).…”
Section: Morphology-based Approachessupporting
confidence: 53%
“…Her results showed that morphological preprocessing helps, but only for smaller corpora. Habash and Sadat (2006) and Sadat and Habash (2006) reached similar conclusions on a much larger set of experiments including multiple preprocessing schemes reflecting different levels of morphological representation and multiple techniques for disambiguation and tokenization. They showed that specific preprocessing decisions can have a positive effect when decoding text with a different genre than that of the training data (in essence another form of data sparsity).…”
Section: Morphology-based Approachessupporting
confidence: 53%
“…Lee et al (2003) use a look-up table for the various prefixes, stems, and suffixes used in the tokenization process. Habash and Sadat (2006) presented various schemes for tokenizing Arabic text for MT, in addition to the Arabic Treebank tokenization (Maamouri et al, 2004). Diab et al (2007) presented an SVM-based approach for tokenization.…”
Section: Background and Related Workmentioning
confidence: 99%
“…However, especially for morphologically complex languages, using sub-lexical units obtained after morphological preprocessing has been shown to improve the machine translation performance over a wordbased system (Popović and Ney, 2004;Habash and Sadat, 2006). For any language, several word tokenization choices, henceforth tokenization schemes, can be generated based on the word's in-context morphological analysis.…”
Section: Introductionmentioning
confidence: 99%
“…experiments [5], we used the MADA tool [12] for morphological disambiguation and applied the D2-scheme of [13] for word segmentation. These tools require the input to be Buckwalter encoded.…”
Section: Adjustment Of Asr and Smt Vocabulariesmentioning
confidence: 99%