The paper presents a new corpus for fake news detection in the Urdu language along with the baseline classification and its evaluation. With the escalating use of the Internet worldwide and substantially increasing impact produced by the availability of ambiguous information, the challenge to quickly identify fake news in digital media in various languages becomes more acute. We provide a manually assembled and verified dataset containing 900 news articles, 500 annotated as real and 400, as fake, allowing the investigation of automated fake news detection approaches in Urdu. The news articles in the truthful subset come from legitimate news sources, and their validity has been manually verified. In the fake subset, the known difficulty of finding fake news was solved by hiring professional journalists native in Urdu who were instructed to intentionally write deceptive news articles. The dataset contains 5 different topics: (i) Business, (ii) Health, (iii) Showbiz, (iv) Sports, and (v) Technology. To establish our Urdu dataset as a benchmark, we performed baseline classification. We crafted a variety of text representation feature sets including word n-grams, character n-grams, functional word n-grams, and their combinations. After applying a variety of feature weighting schemes, we ran a series of classifiers on the train-test split. The results show sizable performance gains by AdaBoost classifier with 0.87 F1Fake and 0.90 F1Real. We provide the results evaluated against different metrics for a convenient comparison of future research. The dataset is publicly available for research purposes.
Supervised learning algorithms are a recent trend for the prediction of mechanical properties of concrete. This paper presents AdaBoost, random forest (RF), and decision tree (DT) models for predicting the compressive strength of concrete at high temperature, based on the experimental data of 207 tests. The cement content, water, fine and coarse aggregates, silica fume, nano silica, fly ash, super plasticizer, and temperature were used as inputs for the models’ development. The performance of the AdaBoost, RF, and DT models are assessed using statistical indices, including the coefficient of determination (R2), root mean squared error-observations standard deviation ratio (RSR), mean absolute percentage error, and relative root mean square error. The applications of the above-mentioned approach for predicting the compressive strength of concrete at high temperature are compared with each other, and also to the artificial neural network and adaptive neuro-fuzzy inference system models described in the literature, to demonstrate the suitability of using the supervised learning methods for modeling to predict the compressive strength at high temperature. The results indicated a strong correlation between experimental and predicted values, with R2 above 0.9 and RSR lower than 0.5 during the learning and testing phases for the AdaBoost model. Moreover, the cement content in the mix was revealed as the most sensitive parameter by sensitivity analysis.
Automatic threatening language detection is an important task and most of the existing studies relied on English. However, threatening language detection in poor-resource language remains briefly addressed. In this paper, we introduce a new publicly available dataset for threatening language detection in Urdu tweets to fill the scientific gap, particularly, in the Urdu language. The proposed dataset contains 3,564 tweets manually annotated by human experts with two labels: threatening and non-threatening. The threatening tweets are further classified into two classes: threatening to an individual person or threatening to a group. This research follows a twostep approach: (i) classify a given tweet as threatening or non-threatening and (ii) classify whether a threatening tweet is used to threaten an individual or a group. We compare three forms of text representation: two count-based, where the text is represented using either character n-gram counts or word n-gram counts as feature vectors and the third text representation is based on fastText pre-trained word embeddings for Urdu. We perform several experiments using machine learning and deep learning classifiers and our study shows that MLP classifier with the combination of word n-gram features outperformed other classifiers in detecting threatening tweets. Whereas, SVM using fastText pre-trained word embedding obtained the best results for the target identification task.
The major criteria that control pile foundation design is pile bearing capacity (Pu). The load bearing capacity of piles is affected by the various characteristics of soils and the involvement of multiple parameters related to both soil and foundation. In this study, a new model for predicting bearing capacity is developed using an extreme gradient boosting (XGBoost) algorithm. A total of 200 driven piles static load test-based case histories were used to construct and verify the model. The developed XGBoost model results were compared to a number of commonly used algorithms—Adaptive Boosting (AdaBoost), Random Forest (RF), Decision Tree (DT) and Support Vector Machine (SVM) using various performance measure metrics such as coefficient of determination, mean absolute error, root mean square error, mean absolute relative error, Nash–Sutcliffe model efficiency coefficient and relative strength ratio. Furthermore, sensitivity analysis was performed to determine the effect of input parameters on Pu. The results show that all of the developed models were capable of making accurate predictions however the XGBoost algorithm surpasses others, followed by AdaBoost, RF, DT, and SVM. The sensitivity analysis result shows that the SPT blow count along the pile shaft has the greatest effect on the Pu.
This study reports the second shared task named as UrduFake@Fire2021 on identifying fake news detection in Urdu language. This is a binary classification problem in which the task is to classify a given news article into two classes: (i) real news, or (ii) fake news. In this shared task, 34 teams from 7 different countries (China, Egypt, Israel, India, Mexico, Pakistan, and UAE) registered to participate in the shared task, 18 teams submitted their experimental results and 11 teams submitted their technical reports. The proposed systems were based on various count-based features and used different classifiers as well as neural network architectures. The stochastic gradient descent (SGD) algorithm outperformed other classifiers and achieved 0.679 F-score.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.