After decades in which a great deal of effort was spent on the creation of resources, there are currently several initiatives worldwide that aim to create an interoperable, sustainable research infrastructure. An integral part of such an infrastructure constitutes the resources (data and tools) which researchers in the various disciplines employ. Whether the infrastructure will be successful in supporting the needs of the research communities it intends to cater for depends on a number of factors. One factor is that resources that are or could be relevant to the wider research community are made visible through this infrastructure and, to the greatest extent possible, accessible and usable. In practice, the durable availability of resources is often not properly regulated within research projects.CLARIN-NL is directed at creating an interoperable language resources infrastructure for the humanities in the Netherlands. The Data Curation Service was established in order to salvage language resources in this field that are threatened to be lost. In the CLARIN context, a great deal of attention is given to standards, formats and intellectual property rights. Consequently, the Data Curation Service (DCS) has a role as mediator in bringing researchers in the field of humanities and existing data centres closer together.This article consists of two parts: the first part provides the background to the work of the DCS while the second part illustrates the work of the DCS by describing the actual curation of a collection of language learner data.
Abstract-In this paper we discuss the compilation of a social media corpus with chats, tweets and SMS text messages as part of the SoNaR corpus, a 500-million word reference corpus of written Dutch, comprising many different text categories. Social media are more and more becoming part of everyday life, which makes the need for social media corpora an urgent matter for research. Special focus was addressed to the collection of metadata and intellectual property rights (IPR). IPR was obtained both through licenses with platform owners, and by consent of individual contributors. Recruitment of participants was done by means of free publicity. The data will be available for research and commercial use.
Abstract-In this paper we discuss the compilation of a social media corpus with chats, tweets and SMS text messages as part of the SoNaR corpus, a 500-million word reference corpus of written Dutch, comprising many different text categories. Social media are more and more becoming part of everyday life, which makes the need for social media corpora an urgent matter for research. Special focus was addressed to the collection of metadata and intellectual property rights (IPR). IPR was obtained both through licenses with platform owners, and by consent of individual contributors. Recruitment of participants was done by means of free publicity. The data will be available for research and commercial use.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.