Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics - ACL '05 2005
DOI: 10.3115/1219840.1219911
|View full text |Cite
|
Sign up to set email alerts
|

Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop

Abstract: We present an approach to using a morphological analyzer for tokenizing and morphologically tagging (including partof-speech tagging) Arabic words in one process. We learn classifiers for individual morphological features, as well as ways of using these classifiers to choose among entries from the output of the analyzer. We obtain accuracy rates on all tasks in the high nineties.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

2
212
0

Year Published

2007
2007
2022
2022

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 253 publications
(214 citation statements)
references
References 6 publications
2
212
0
Order By: Relevance
“…MADAMIRA (Pasha et al, 2014); the current state-of-the-art tool for Arabic morphological analysis and disambiguation, obtains the disambiguated morphological analysis of the word, and feeds it to a tokenization engine. MADAMIRA utilizes MADA (Habash and Rambow, 2005;Roth et al, 2008) for morphological disambiguation. The top morphological analysis is then used for tokenization deterministically through one of the tokenization schemes.…”
Section: Background and Related Workmentioning
confidence: 99%
“…MADAMIRA (Pasha et al, 2014); the current state-of-the-art tool for Arabic morphological analysis and disambiguation, obtains the disambiguated morphological analysis of the word, and feeds it to a tokenization engine. MADAMIRA utilizes MADA (Habash and Rambow, 2005;Roth et al, 2008) for morphological disambiguation. The top morphological analysis is then used for tokenization deterministically through one of the tokenization schemes.…”
Section: Background and Related Workmentioning
confidence: 99%
“…For the alignment task, the data was tokenized and lowercased for English, and transliterated and segmented using MADA [2] for Arabic. Table 2 shows the correspondences between the one of the seven English connective "while" and Arabic translations detected automatically using the annotation projection from English sentences to Arabic ones.…”
Section: Towards a Multilingual Act Metricmentioning
confidence: 99%
“…The raw Arabic text is enriched and tokenized using the Morphological Analysis and Disambiguation for Arabic (MADA) toolkit (Habash and Rambow, 2005). The various Arabic tokenization schemes that we experiment with, span a segmentation spectrum ranging from coarse segmentation, which uses unsegmented text, to fine segmentation which splits off all possible clitics.…”
Section: Arabic Preprocessing Schemesmentioning
confidence: 99%
“…enriched and the different tokenization generated using the Morphological Analysis and Disambiguation for Arabic (MADA) toolkit (Habash and Rambow, 2005). The parallel training corpora was then filtered by first removing sentence pairs longer than 99 on either side then deleting unbalanced sentence pairs with ratio more than a 4-to-1 in either direction.…”
Section: Training Datamentioning
confidence: 99%