Proceedings of the Fifth Workshop on South and Southeast Asian Natural Language Processing 2014
DOI: 10.3115/v1/w14-5505
|View full text |Cite
|
Sign up to set email alerts
|

English to Urdu Statistical Machine Translation: Establishing a Baseline

Abstract: The aim of this paper is to categorize and present the existence of resources for Englishto-Urdu machine translation (MT) and to establish an empirical baseline for this task. By doing so, we hope to set up a common ground for MT research with Urdu to allow for a congruent progress in this field. We build baseline phrase-based MT (PBMT) and hierarchical MT systems and report the results on 3 official independent test sets. On all test sets, hierarchial MT significantly outperformed PBMT. The highest single-ref… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 14 publications
(14 citation statements)
references
References 11 publications
0
14
0
Order By: Relevance
“…To create our proposed dictionary, we collected data from two different sources: (1) the Urdu Mono-lingual (UrMono) Corpus (https://lindat.mff.cuni.cz/repository/xmlui/handle/ 11858/00-097C-0000-0023-65A9-5, which was last visited on 14 March 2023) by [23] and (2) the Urdu Wikipedia dump (http://wikipedia.c3sl.ufpr.br/urwiki/20191120/, accessed on 14 March 2023). In the first step, the source text from the UrMono corpus was pre-processed (we identified and removed stop words, URLs, digits, and English alphabets) and tokenized (tokenization was done with the help of the Urdu word tokenizer developed by [24]).…”
Section: Data Sourcementioning
confidence: 99%
“…To create our proposed dictionary, we collected data from two different sources: (1) the Urdu Mono-lingual (UrMono) Corpus (https://lindat.mff.cuni.cz/repository/xmlui/handle/ 11858/00-097C-0000-0023-65A9-5, which was last visited on 14 March 2023) by [23] and (2) the Urdu Wikipedia dump (http://wikipedia.c3sl.ufpr.br/urwiki/20191120/, accessed on 14 March 2023). In the first step, the source text from the UrMono corpus was pre-processed (we identified and removed stop words, URLs, digits, and English alphabets) and tokenized (tokenization was done with the help of the Urdu word tokenizer developed by [24]).…”
Section: Data Sourcementioning
confidence: 99%
“…Complex words dictionary: To address the space insertion problem, a large complex words dictionary was created using the UMC Urdu data set (Jawaid, Kamran, and Bojar 2014), which contains data from various domains including Sports, Politics, Blogs, Education, Literature, Entertainment, Science, Technology, Commerce, Health, Law, Business, Showbiz, Fiction, and Weather. From each domain, at least 1000 sentences were randomly selected and preprocessed to remove noise (see Section 5.4).…”
Section: Urdu Word Tokenizer 411 Generating Supporting Resources For ...mentioning
confidence: 99%
“…The TBL approaches derives rules automatically from the corpus and produces more accuracy as compared to other approaches and provides more advantages than any other approach used for tagging systems. In [23], the new Urdu POS tag set design schema is presented. The overall system accuracy is reported as 96.8%.…”
Section: Urdu Pos Taggingmentioning
confidence: 99%