2022
DOI: 10.1177/23780231221135523
|View full text |Cite
|
Sign up to set email alerts
|

From Documents to Data: A Framework for Total Corpus Quality

Abstract: As large corpora of digitized text become increasingly available, researchers are rediscovering textual data’s potential fruitfulness for inquiries into social and cultural phenomena. Although textual corpora promise to enrich our knowledge of the social world, avoiding problems related to data quality remains a challenge to related empirical research. Hence, evaluating the quality of a corpus will be pivotal for future social scientific inquiries. The authors propose a conceptual framework for total corpus qu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
1
1

Relationship

0
9

Authors

Journals

citations
Cited by 10 publications
(9 citation statements)
references
References 93 publications
0
3
0
Order By: Relevance
“…The chapter suggests that as the toolbox of ML approaches expands, so will the need for methodological reflection on the datasets and algorithms used, analyzed, and interpreted. Research needs to consider the production, curation, and limitation of each dataset by considering each corpus as a product of social practices and decisions (Mützel, 2015a), and thus of a certain data quality (Hurtado Bodell et al, 2022). For sociological research, the growing toolbox also contains data sources other than text, like sounds and images.…”
Section: Conclusion: Generating Theoretical Insights and Methodologic...mentioning
confidence: 99%
See 1 more Smart Citation
“…The chapter suggests that as the toolbox of ML approaches expands, so will the need for methodological reflection on the datasets and algorithms used, analyzed, and interpreted. Research needs to consider the production, curation, and limitation of each dataset by considering each corpus as a product of social practices and decisions (Mützel, 2015a), and thus of a certain data quality (Hurtado Bodell et al, 2022). For sociological research, the growing toolbox also contains data sources other than text, like sounds and images.…”
Section: Conclusion: Generating Theoretical Insights and Methodologic...mentioning
confidence: 99%
“…The field has begun to systematically develop such guidelines and standards for ML research methods in general (Kapoor et al, 2023) and, separately, for data quality. Hurtado Bodell, Magnusson, & Mützel (2022) propose a framework for assessing total corpus quality that identifies all stages-study design, data collection, processing, and, finally, analysis-at which potential errors in working with a digitized dataset can occur. They use digitized Swedish newspapers to demonstrate the framework yet underscore its application to other projects that use sound, images, or digital data.…”
Section: Developing a Culture Of Datamentioning
confidence: 99%
“…Assessing the quality of curating work is pivotal (Hurtado Bodell et al, 2022). We evaluated the quality of the annotated Courier article corpus in two dimensions: the OCR quality, and the quality of our process of annotating and compiling the Courier articles.…”
Section: (42) Quality Assurancementioning
confidence: 99%
“…The second part details our exploratory work at KBLab, the library's data lab for digital research [8], in testing AI models to curate the digitized newspaper material. We have two key aims with such an account: providing orientation for researchers interested in using KB's newspaper data; and contributing towards a recent trend foregrounding an active approach to data readiness [7,9,10].…”
Section: Introductionmentioning
confidence: 99%