We present an overview of the CLEF-2019 Lab ProtestNews on Extracting Protests from News in the context of generalizable natural language processing. The lab consists of document, sentence, and token level information classification and extraction tasks that were referred as task 1, task 2, and task 3 respectively in the scope of this lab. The tasks required the participants to identify protest relevant information from English local news at one or more aforementioned levels in a cross-context setting, which is cross-country in the scope of this lab. The training and development data were collected from India and test data was collected from India and China. The lab attracted 58 teams to participate in the lab. 12 and 9 of these teams submitted results and working notes respectively. We have observed neural networks yield the best results and the performance drops significantly for majority of the submissions in the cross-country setting, which is China.
We propose a coherent set of tasks for protest information collection in the context of generalizable natural language processing. The tasks are news article classification, event sentence detection, and event extraction. Having tools for collecting event information from data produced in multiple countries enables comparative sociology and politics studies. We have annotated news articles in English from a source and a target country in order to be able to measure the performance of the tools developed using data from one country on data from a different country. Our preliminary experiments have shown that the performance of the tools developed using English texts from India drops to a level that are not usable when they are applied on English texts from China. We think our setting addresses the challenge of building generalizable NLP tools that perform well independent of the source of the text and will accelerate progress in line of developing generalizable NLP systems.
What is the most optimal way of creating a gold standard corpus for training a machine learning system that is designed for automatically collecting protest information in a cross-country context? We show that creating a gold standard corpus for training and testing machine learning models on the basis of randomly chosen news articles from news archives yields better performance than selecting news articles on the basis of keyword filtering, which is the most prevalent method currently used in automated event coding. We advance this new bottom-up approach to ensure generalizability and reliability in cross-country comparative protest event collection from international and local news in different countries, languages, sources and time periods, which entails a large variety of event types, actors, and targets. We present the results of comparing our random-sample approach with keyword filtering. We show that the machine learning algorithms, and particularly state-of-the-art deep learning tools, perform much better when they are trained with the gold standard corpus from a randomly selected set of news articles from China, India, and South Africa. Finally, we also present our approach to overcome the major ethical issues that are intrinsic to protest event coding.
We describe a gold standard corpus of protest events that comprise various local and international English language sources from various countries. The corpus contains document-, sentence-, and token-level annotations. This corpus facilitates creating machine learning models that automatically classify news articles and extract protest event-related information, constructing knowledge bases that enable comparative social and political science studies. For each news source, the annotation starts with random samples of news articles and continues with samples drawn using active learning. Each batch of samples is annotated by two social and political scientists, adjudicated by an annotation supervisor, and improved by identifying annotation errors semi-automatically. We found that the corpus possesses the variety and quality that are necessary to develop and benchmark text classification and event extraction systems in a cross-context setting, contributing to the generalizability and robustness of automated text processing systems. This corpus and the reported results will establish a common foundation in automated protest event collection studies, which is currently lacking in the literature.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.