Carrying out research tasks on data collections is hampered, or even made impossible, by data quality issues of different types, such as incompleteness or inconsistency, and severity. We identify research tasks carried out by professional users of data collections that are hampered by inherent quality issues. We investigate what types of issues exist and how they influence these research tasks. To measure the quality perceived by professional users, we develop a quality metric. This allows us to measure the suitability of the data quality for a chosen user task. For a chosen task, we study how the data quality can be improved using crowdsourcing. We validate our quality metric by investigating whether professionals perform better on the chosen research task.
MOTIVATIONDigitization initiatives in numerous libraries and archives and (linked) open data projects lead to a growing amount of digital information that can be used for research. While some disciplines within the humanities, such as literary studies [3], have already adopted research questions and practices that make use of digital data, other disciplines are still at an earlier stage of this process. This paradigm shift caused researchers to reflect on the changes that are required in their approaches [1] and how the new practices can extend the current research landscape.The data custodians, on the other side, put effort in making more content available in a way that users can easily access and navigate through it. The evaluation of digital archives and libraries needs to deal with a variety of aspects: data quality in respect to completeness, accuracy and consistency [6], usability of the interfaces and biases caused by selective digitization and collection policies [10]. On top of this, specific requirements of research tasks towards data enrichment and presentation have to be taken into account as * Third year PhD student at CWI, supervised by Jacco van Ossenbruggen and Lynda Hardman.Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author. different tasks may e.g. weigh precision and recall differently [11]. For some tasks, objectively measurable aspects are crucial, while for other tasks the subjective perspective of users is more important [7].To our knowledge, no research has so far evaluated how well the data of digital archives supports specific research tasks of humanities researchers. Our research will therefore focus on the evaluation of data fitness for specific research tasks and how it can be improved.To make sure that the data in libraries and archives meets the requirements of researchers, improvements would ideally be made by experts, such as archivists and librarians. Their expertise, however, is ...