Urdu language processing: a survey

Daud, Ali; Khan, Wahab; Che, Dunren

doi:10.1007/s10462-016-9482-x

Cited by 138 publications

(72 citation statements)

References 40 publications

(71 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The study of such techniques expanded rapidly as they were adopted in a wide range of systems [1]. The study of such techniques expanded rapidly as they were adopted in a wide range of systems [1].…”

Section: Introductionmentioning

confidence: 99%

“…The study of such techniques expanded rapidly as they were adopted in a wide range of systems [1]. Although NER frameworks have been proposed for non-European languages, such as Arabic, Persian, and South Asian languages [3], Urdu systems are still in the early stages of development [1,4]. Such systems have achieved a mature status and are able to provide effective results.…”

Section: Introductionmentioning

confidence: 99%

“…NER techniques were initially studied in Message Understanding Conferences that were initiated by the Defense Advanced Research Projects Agency to aid in the development of information extraction techniques. The study of such techniques expanded rapidly as they were adopted in a wide range of systems [1]. Early systems mainly supported European languages [2], including English.…”

Section: Introductionmentioning

confidence: 99%

“…Such systems have achieved a mature status and are able to provide effective results. Although NER frameworks have been proposed for non-European languages, such as Arabic, Persian, and South Asian languages [3], Urdu systems are still in the early stages of development [1,4]. Therefore, NER for Urdu is a difficult task requiring greater sophistication in linguistic analysis and the development of techniques for effective task performance.…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Deep recurrent neural networks with word embeddings for Urdu named entity recognition

et al. 2019

Self Cite

View full text Add to dashboard Cite

Named entity recognition (NER) continues to be an important task in natural language processing because it is featured as a subtask and/or subproblem in information extraction and machine translation. In Urdu language processing, it is a very difficult task. This paper proposes various deep recurrent neural network (DRNN) learning models with word embedding. Experimental results demonstrate that they improve upon current state‐of‐the‐art NER approaches for Urdu. The DRRN models evaluated include forward and bidirectional extensions of the long short‐term memory and back propagation through time approaches. The proposed models consider both language‐dependent features, such as part‐of‐speech tags, and language‐independent features, such as the “context windows” of words. The effectiveness of the DRNN models with word embedding for NER in Urdu is demonstrated using three datasets. The results reveal that the proposed approach significantly outperforms previous conditional random field and artificial neural network approaches. The best f‐measure values achieved on the three benchmark datasets using the proposed deep learning approaches are 81.1%, 79.94%, and 63.21%, respectively.

show abstract

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Deep recurrent neural networks with word embeddings for Urdu named entity recognition

et al. 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Nearly every information one requests are currently accessible on the internet [6]. English and European languages have mainly dominated the web since its beginning [7]. However, in the past few years, a widespread range of information in the Indian local languages such as Urdu, Hindi, Bengali, Oriya, Tamil, and Telugu have been observed on the internet [6,8].…”

Section: Introductionmentioning

confidence: 99%

Unsupervised Machine Learning based Documents Clustering in Urdu

Rahman

Khan

et al. 2018

ICST Transactions on Scalable Information Systems

Self Cite

View full text Add to dashboard Cite

The volume of data on the web is growing rapidly, due to the proliferation of news sources, contents, blogs and journals etc. Like other languages, the Urdu language has also observed tremendous growth on the internet. As the volume of data is expanding, information retrieval (IR) is becoming complicated. Document clustering is an unsupervised ML approach, employed to group a huge number of dispersed documents into a small number of significant and consistent clusters, thus providing a base for indexing, IR and browsing mechanisms. Documents clustering has a long tradition in English as well as English like western languages, but Urdu lags behind in terms sophisticated natural language processing (NLP) tools and resources for documents clustering. Documents clustering becomes a challenging task in Urdu language having a rich morphology, particular structure, syntax peculiarities and cursive nature. In this study, we have developed a framework of document clustering and analysed various similarity measures for Urdu documents. We have also checked the effect of stop words removal in the process of Urdu document clustering.

show abstract

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Muneer¹,

Sharjeel

Iqbal

et al. 2018

Asso for Info Science & Tech

View full text Add to dashboard Cite

Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse corpus for the English-Urdu language pair. The Cross-Language English-Urdu Corpus (CLEU) has source text in English whereas the derived text is in Urdu. It contains in total 3,235 sentence/passage pairs manually tagged into three categories that is near copy, paraphrased copy, and independently written. Further, as a second contribution, we evaluate the Translation plus Mono-lingual Analysis method using three sets of experiments on the proposed dataset to highlight its usefulness. Evaluation results (f 1 =0.732 binary, f 1 =0.552 ternary classification) indicate that it is harder to detect cross-language real cases of text reuse, especially when the language pairs have unrelated scripts. The corpus is a useful benchmark resource for the future development and assessment of cross-language text reuse detection systems for the English-Urdu language pair.

show abstract

Urdu language processing: a survey

Cited by 138 publications

References 40 publications

Deep recurrent neural networks with word embeddings for Urdu named entity recognition

Deep recurrent neural networks with word embeddings for Urdu named entity recognition

Unsupervised Machine Learning based Documents Clustering in Urdu

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Contact Info

Product

Resources

About