Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020 2020
DOI: 10.1145/3383583.3398567
|View full text |Cite
|
Sign up to set email alerts
|

The POLUSA Dataset: 0.9M Political News Articles Balanced by Time and Outlet Popularity

Abstract: News articles covering policy issues are an essential source of information in the social sciences and are also frequently used for other use cases, e.g., to train NLP language models. To derive meaningful insights from the analysis of news, large datasets are required that represent real-world distributions, e.g., with respect to the contained outlets' popularity, topically, or across time. Information on the political leanings of media publishers is often needed, e.g., to study differences in news reporting … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
11
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
5
3

Relationship

3
5

Authors

Journals

citations
Cited by 14 publications
(11 citation statements)
references
References 8 publications
0
11
0
Order By: Relevance
“…NoConflict Team NoConflict used their model of protest event sentence classification from the winning submission of the English version of Task 1 Subtask 2. Their model is based on a RoBERTa (Liu et al, 2019) backbone with a second pretraining (Gururangan et al, 2020) stage done on the POLUSA (Gebhard and Hamborg, 2020) data set before finetuned on Subtask 2 data. For the NYT data set, they first filtered the articles based on the section name.…”
Section: Team Systemsmentioning
confidence: 99%
“…NoConflict Team NoConflict used their model of protest event sentence classification from the winning submission of the English version of Task 1 Subtask 2. Their model is based on a RoBERTa (Liu et al, 2019) backbone with a second pretraining (Gururangan et al, 2020) stage done on the POLUSA (Gebhard and Hamborg, 2020) data set before finetuned on Subtask 2 data. For the NYT data set, they first filtered the articles based on the section name.…”
Section: Team Systemsmentioning
confidence: 99%
“…Second Pretraining We start by conducting an additional round of pretraining of RoBERTa, initialized with the already pretrained weight, following Gururangan et al (2020). To this end, we pretrain on the POLUSA dataset (Gebhard and Hamborg, 2020) in an MLM setting with a masking probability of 0.15. We denote this pretraining step as Second Pretraining.…”
Section: Proposed Methodsmentioning
confidence: 99%
“…Due to time and resource constraint, however, we cannot gain access to articles from these outlets. Thus, we resort to POLUSA (Gebhard and Hamborg, 2020). While it is not from the same outlets, the fact that it only contains political news make it suitable for our purpose.…”
Section: A Appendixmentioning
confidence: 99%
See 1 more Smart Citation
“…For research and evaluation of the previously described system and its analysis methods, I currently use the datasets AllSides (Chen et al, 2018), NewsWCL50 (Hamborg et al, 2019c), and PO-LUSA (Gebhard and Hamborg, 2020), which have high diversity concerning outlets' political slant.…”
Section: System and Visualizationmentioning
confidence: 99%