2022
DOI: 10.1017/s1351324921000425
|View full text |Cite
|
Sign up to set email alerts
|

UNLT: Urdu Natural Language Toolkit

Abstract: This study describes a Natural Language Processing (NLP) toolkit, as the first contribution of a larger project, for an under-resourced language—Urdu. In previous studies, standard NLP toolkits have been developed for English and many other languages. There is also a dire need for standard text processing tools and methods for Urdu, despite it being widely spoken in different parts of the world with a large amount of digital text being readily available. This study presents the first version of the UNLT (Urdu … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2025
2025

Publication Types

Select...
7

Relationship

1
6

Authors

Journals

citations
Cited by 15 publications
(5 citation statements)
references
References 51 publications
0
3
0
Order By: Relevance
“…-Word tokenization: word tokenization is the process of taking a piece of text and breaking it down into individual words, or tokens. In the context of email spam detection, word tokenization can be used to extract individual words from email messages to analyze and identify spam patterns (Shafi et al, 2022). By tokenizing the words in the message, we can analyze the individual words and look for certain patterns, or words, that are commonly associated with spam messages.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…-Word tokenization: word tokenization is the process of taking a piece of text and breaking it down into individual words, or tokens. In the context of email spam detection, word tokenization can be used to extract individual words from email messages to analyze and identify spam patterns (Shafi et al, 2022). By tokenizing the words in the message, we can analyze the individual words and look for certain patterns, or words, that are commonly associated with spam messages.…”
Section: Data Preprocessingmentioning
confidence: 99%
“…FudanNLP employs statistics-based and rule-based methods to tackle various NLP tasks, including word segmentation, POS tagging, NER, dependency parsing, anaphora resolution, and timephrase recognition. For Urdu, a sister language of Pashto, [13] developed the UNLT toolkit, which includes three preliminary NLP tools: word tokenizer, sentence tokenizer, and partof-speech tagger. The word tokenizer utilizes a morphemematching algorithm combined with a stochastic n-gram model.…”
Section: Related Workmentioning
confidence: 99%
“…They constructed their model using CRF and ME methods for assigning the POS tag to a word [28]. Over the years, researchers such as Nunsanga MV et al [29], Shafi J et al [30], Singha RK et al [31], Dalai T et al [32] proposed probabilistic models for Mizo, Urdu, Manipuri and Odia language respectively. Transportation of a language model to another language is easier in the case of such probabilistic models.…”
Section: ) Parts Of Speech Tagging For Indo-aryan Languagesmentioning
confidence: 99%