PurposeBudgeting data curation tasks in research projects is difficult. In this paper, we investigate the time spent on data curation, more specifically on cleaning and documenting quantitative data for data sharing. We develop recommendations on cost factors in research data management.Design/methodology/approachWe make use of a pilot study conducted at the GESIS Data Archive for the Social Sciences in Germany between December 2016 and September 2017. During this period, data curators at GESIS - Leibniz Institute for the Social Sciences documented their working hours while cleaning and documenting data from ten quantitative survey studies. We analyse recorded times and discuss with the data curators involved in this work to identify and examine important cost factors in data curation, that is aspects that increase hours spent and factors that lead to a reduction of their work.FindingsWe identify two major drivers of time spent on data curation: The size of the data and personal information contained in the data. Learning effects can occur when data are similar, that is when they contain same variables. Important interdependencies exist between individual tasks in data curation and in connection with certain data characteristics.Originality/valueThe different tasks of data curation, time spent on them and interdependencies between individual steps in curation have so far not been analysed.
Comparative statistical analyses often require data harmonization, yet the social sciences do not have clear operationalization frameworks that guide and homogenize variable coding decisions across disciplines. When faced with a need to harmonize variables researchers often look for guidance from various international studies that employ output harmonization, such as the Comparative Survey of Election Studies, which offer recoding structures for the same variable (e.g. marital status). More problematically there are no agreed documentation standards or journal requirements for reporting variable harmonization to facilitate a transparent replication process. We propose a conceptual and data-driven digital solution that creates harmonization documentation standards for publication and scholarly citation: QuickCharmStats 1.1. It is free and open-source software that allows for the organizing, documenting and publishing of data harmonization projects. QuickCharmStats starts at the conceptual level and its workflow ends with a variable recording syntax. It is therefore flexible enough to reflect a variety of theoretical justifications for variable harmonization. Using the socio-demographic variable ‘marital status’, we demonstrate how the CharmStats workflow collates metadata while being guided by the scientific standards of transparency and replication. It encourages researchers to publish their harmonization work by providing researchers who complete the peer review process a permanent identifier. Those who contribute original data harmonization work to their discipline can now be credited through citations. Finally, we propose peer-review standards for harmonization documentation, describe a route to online publishing, and provide a referencing format to cite harmonization projects. Although CharmStats products are designed for social scientists our adherence to the scientific method ensures our products can be used by researchers across the sciences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.