2020
DOI: 10.3233/jifs-179905
|View full text |Cite
|
Sign up to set email alerts
|

“Bend the truth”: Benchmark dataset for fake news detection in Urdu language and its evaluation

Abstract: The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
57
0
1

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 46 publications
(58 citation statements)
references
References 6 publications
0
57
0
1
Order By: Relevance
“…More recently, Wang et al (2020) collected a Weibo dataset containing 7,300 news articles in Chinese across eight domains (health, economic, technology, entertainment, society, military, political, and education). Similarly, Amjad et al (2020) proposed a benchmark dataset in the Urdu language that contains 900 news articles in 5 different domains (business, health, showbiz, sports, and technology). Therefore, to overcome the limitations on the diversity of fact-checking, future datasets should be collected from a variety of news domains.…”
Section: Survey Methodologymentioning
confidence: 99%
“…More recently, Wang et al (2020) collected a Weibo dataset containing 7,300 news articles in Chinese across eight domains (health, economic, technology, entertainment, society, military, political, and education). Similarly, Amjad et al (2020) proposed a benchmark dataset in the Urdu language that contains 900 news articles in 5 different domains (business, health, showbiz, sports, and technology). Therefore, to overcome the limitations on the diversity of fact-checking, future datasets should be collected from a variety of news domains.…”
Section: Survey Methodologymentioning
confidence: 99%
“…Existe un predominio de los trabajos que proponen sistemas o métodos completos para la detección de contenidos desinformativos (n=74; 77,08%), frente a la menor cantidad de estudios sobre bases de datos (n=11; 11,45%) y algoritmos específicos (n=9; 9,37%) (tabla 5). En el caso de las bases de datos, son mayoritarias las realizadas en lengua inglesa con el fin de entrenar algoritmos utilizados en modelos de análisis lingüístico en este idioma; si bien existen aproximaciones centradas en otros, como el portugués (Silva, Santos, Almeida y Pardo, 2020), el español (Posadas-Durán, Gómez-Adorno, Sidorov y Moreno-Escobar, 2019) y lenguas minoritarias como el urdu (Amjad et al 2020).…”
Section: Años De Publicación Países Propuestas Y Modelos Algorítmicosunclassified
“…To detect fake news, they performed SVM, LR, RF, and boosting on bag of words (BOW), POS tags, and n-grams features sets of their datasets and find character 4-grams without removing the stop words with the Boosting algorithm has the best performance in accuracy. Amjad et al [5] proposed a new Urdu language corpus: "Bend The Truth" for fake news detection which contains 900 news articles, 500 annotated as real and 400 labeled as fake. Their text representation feature sets include the combination of word n-grams, character n-grams, functional word n-grams (n ranging from 1 to 6) with a variety of feature weighting schemes including binary values, normalized frequency, log-entropy weighting, raw frequency, relative frequency, and TF-IDF.…”
Section: Literature Reviewmentioning
confidence: 99%