Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various efforts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct effect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure.
The paper discusses the current state of Sindhi corpus construction in detail. Sindhi corpus development issues including corpus acquisition, preprocessing, and tokenization are discussed in detail. Preliminary results and observations which include letter unigram, bigram and trigram frequencies; word frequencies and word bigram frequencies are presented. Current state of Sindhi corpus with its limitations and future work is also discussed. The paper also explores the orthography and script of Sindhi language with reference to corpus development.
Twitter, a social media platform has experienced substantial growth over the last few years. Thus, huge number of tweets from various communities is available and used for various NLP applications such as Opinion mining, information extraction, sentiment analysis etc. One of the key pre-processing steps in such NLP applications is Part-of-Speech (POS) tagging. POS tagging of Twitter data (also called noisy text) is different than conventional POS tagging due to informal nature and presence of Twitter specific elements. Resources for POS tagging of tweet specific data are mostly available for English. Though, availability of tagset and language independent statistical taggers do provide opportunity for resource-poor languages such as Urdu to expand coverage of NLP tools to this new domain of POS tagging for which little effort has been reported. The aim of this study is twofold. First, is to investigate how well the statistical taggers developed for POS tagging of structured text fare in the domain of tweet POS tagging. Secondly, how can these taggers be used to overcome the bottleneck of manually annotated corpus for this new domain. To this end, Stanford and MorphoDiTa taggers were trained on 500 Urdu tweet gold-standard corpus and were utilized for semi-automatic corpus annotation in bootstrapped fashion. Five bootstrapping iterations for both the taggers were performed. At the end of each iteration, the performance of taggers was evaluated against the development set and automatically tagged, manually corrected 100 tweets were added in the training set to retrain both models. Finally, at the end of last iteration, tagger performance was evaluated against test set. Stanford tagger achieved an accuracy of 93.8% Precision, 92.9% Recall and 93.3% F-Measure. Whereas, MorphoDiTa tagger achieved an accuracy of 93.5% Precision, 92.6% Recall and 93% F-Measure. A thorough error analysis on the output of both taggers is also presented.
We discuss agreeing adverbs in Urdu, Sindhi and Punjabi. We adduce crosslinguistic evidence that is based mainly on similar patterns in Romance and posit that there is a close connection between resultatives and so-called pseudo-resultatives, which the agreeing adverbs appear to instantiate. We propose a diachronic relationship by which the originally predicative part of a resultative is reinterpreted as an adjunct that modifies the overall event predication, not just the result.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.