Today, the rapid dissemination of information on digital platforms has seen the emergence of information pollution such as misinformation, disinformation, fake news, and different types of propaganda. Information pollution has become a serious threat to the online digital world and has posed several challenges to social media platforms and governments around the world. In this paper, we propose Propaganda Spotting in Online Urdu Language (ProSOUL)-a framework to identify content and sources of propaganda spread in the Urdu language. First, we develop a labelled dataset of 11,574 Urdu news to train the machine learning classifiers. Next, we develop the Linguistic Inquiry and Word Count (LIWC) dictionary to extract psycho-linguistic features of Urdu text. We evaluate the performance of different classifiers by varying n-gram, News Landscape (NELA), Word2Vec, and Bidirectional Encoder Representations from Transformers (BERT) features. Our results show that the combination of NELA, word n-gram, and character n-gram features outperform with 0.91 accuracy for Urdu text classification. In addition, Word2Vec embedding outperforms BERT features in classification of the Urdu text with 0.87 accuracy. Moreover, we develop and classify large scale Urdu content repositories to identify web sources spreading propaganda. Our results show that ProSOUL framework performs best for propaganda detection in the online Urdu news content compared to the general web content. To the best of our knowledge, this is the first study on the detection of propaganda content in the Urdu language.
Urdu is the most popular language in Pakistan which is spoken by millions of people across the globe. While English is considered the dominant web content language, characteristics of Urdu language web content are still unknown. In this paper, we study the World-Wide-Web (WWW) by focusing on the content present in the Perso-Arabic script. Leveraging from the Common Crawl Corpus, which is the largest publicly available web content of 2.87 billion documents for the period of December 2016, we examine different aspects of Urdu web content. We use the Compact Language Detector (CLD2) for language detection. We find that the global WWW population has a share of 0.04% for Urdu web content with respect to document frequency. 70.9% of the top-level Urdu domains consist of . com, . org, and . info. Besides, urdulughat is the most dominating second-level domain. 40% of the domains are hosted in the United States while only 0.33% are hosted within Pakistan. Moreover, 25.68% web-pages have Urdu as primary language and only 11.78% of web-pages are exclusively in Urdu. Our Urdu corpus consists of 1.25 billion total and 18.14 million unique tokens. Furthermore, the corpus follows the Zipf’s law distribution. This Urdu Corpus can be used for text summarization, text classification, and cross-lingual information retrieval.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.