Proceedings of the 2007 Workshop on Computational Approaches to Semitic Languages Common Issues and Resources - Semitic '07 2007
DOI: 10.3115/1654576.1654588
|View full text |Cite
|
Sign up to set email alerts
|

Arabic tokenization system

Abstract: Tokenization is a necessary and non-trivial step in natural language processing. In the case of Arabic, where a single word can comprise up to four independent tokens, morphological knowledge needs to be incorporated into the tokenizer. In this paper we describe a rule-based tokenizer that handles tokenization as a full-rounded process with a preprocessing stage (white space normalizer), and a post-processing stage (token filter). We also show how it handles multiword expressions, and how ambiguity is resolved. Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
32
0

Year Published

2011
2011
2021
2021

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 71 publications
(32 citation statements)
references
References 12 publications
0
32
0
Order By: Relevance
“…Splitting sentences into tokens is the purpose of tokenization. It also enables them to end up being given into POS tagger or a morphological analyzer for further processing (Attia, 2007).…”
Section: Corpus and Pre-processingmentioning
confidence: 99%
“…Splitting sentences into tokens is the purpose of tokenization. It also enables them to end up being given into POS tagger or a morphological analyzer for further processing (Attia, 2007).…”
Section: Corpus and Pre-processingmentioning
confidence: 99%
“…One of the most useful features in detecting boundaries of sentences and tokens is punctuation marks. The number of punctuation marks and symbols used in Arabic corpus is 134 [31]. There are several methods to implement tokenization; the simplest way we used is extracting any alphanumeric string between two white spaces.…”
Section: Methodsmentioning
confidence: 99%
“…an Arabic token may consist of several lexical items which have their own meaning and POS. For example; according to Attia (2007) an Arabic verb can comprise up to four clitics: conjunction, tense, stems with affixes and object pronouns as shown in the following example.…”
Section: The Tokenizer Modulementioning
confidence: 99%
“…Compared to what has been done in English and other languages, there is only one approach that has been investigated for Arabic shallow parsing Diab et al (2004;2007; . Diab et al (2004) performed tokenization, POS tagging and used an SVM-based approach for Arabic text chunking.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation