Wikipedia is an online encyclopedia which anyone can edit. While most edits are constructive, about 7% are acts of vandalism. Such behavior is characterized by modifications made in bad faith; introducing spam and other inappropriate content. In this work, we present the results of an effort to integrate three of the leading approaches to Wikipedia vandalism detection: a spatiotemporal analysis of metadata (STiki), a reputation-based system (Wiki-Trust), and natural language processing features. The performance of the resulting joint system improves the state-of-the-art from all previous methods and establishes a new baseline for Wikipedia vandalism detection. We examine in detail the contribution of the three approaches, both for the task of discovering fresh vandalism, and for the task of locating vandalism in the complete set of Wikipedia revisions. Authors appear alphabetically. Order does not reflect contribution magnitude.
ii AbstractWikipedia is an online encyclopedia that anyone can edit. The fact that there are almost no restrictions to contributing content is at the core of its success. However, it also attracts pranksters, lobbysts, spammers and other people who degradates Wikipedia's contents. One of the most frequent kind of damage is vandalism, which is defined as any bad faith attempt to damage Wikipedia's integrity.For some years, the Wikipedia community has been fighting vandalism using automatic detection systems. In this work, we develop one of such systems, which won the 1st International Competition on Wikipedia Vandalism Detection. This system consists of a feature set exploiting textual content of Wikipedia articles. We performed a study of different supervised classification algorithms for this task, concluding that ensemble methods such as Random Forest and LogitBoost are clearly superior.After that, we combine this system with two other leading approaches based on different kind of features: metadata analysis and reputation. This joint system obtains one of the best results reported in the literature. We also conclude that our approach is mostly language independent, so we can adapt it to languages other than English with minor changes.iii Resumen Wikipedia es una enciclopedia en línea que cualquiera puede editar. El hecho que de apenas hay restricciones para contribuir contenido está en el corazón de su éxito. Sin embargo, esto también atrae a bromistas, cabilderos, spammers y otras personas que degradan los contenidos de Wikipedia.Uno de los tipos de daño más frecuente es el vandalismo, definido como cualquier intento, de mala fe, de dañar la integridad de Wikipedia.Desde hace algunos años, la comunidad de Wikipedia ha estado luchando contra el vandalismo usando sistemas automáticos de detección. En este trabajo, desarrollamos uno de estos sistemas, que ganó la Primera Competi- Después, combinamos este sistema con otras dos aproximaciones punteras basadas en distintos tipos de características: análisis de metadatos y reputación. Este sistema conjunto obtiene uno de los mejores resultados publicados en la literatura. También concluimos que nuestra aproximación es principalmente independiente del lenguaje, por lo que podemos adaptarlo a idiomas distintos al inglés con cambios menores.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.