Over the past years, the amount of online offensive speech has been growing steadily. To successfully cope with it, machine learning is applied. However, ML-based techniques require sufficiently large annotated datasets. In the last years, different datasets were published, mainly for English. In this paper, we present a new dataset for Portuguese, which has not been in focus so far. The dataset is composed of 5,668 tweets. For its annotation, we defined two different schemes used by annotators with different levels of expertise. First, non-experts annotated the tweets with binary labels ('hate' vs. 'no-hate'). Then, expert annotators classified the tweets following a fine-grained hierarchical multiple label scheme with 81 hate speech categories in total. The inter-annotator agreement varied from category to category, which reflects the insight that some types of hate speech are more subtle than others and that their detection depends on personal perception. The hierarchical annotation scheme is the main contribution of the presented work, as it facilitates the identification of different types of hate speech and their intersections. To demonstrate the usefulness of our dataset, we carried a baseline classification experiment with pre-trained word embeddings and LSTM on the binary classified data, with a state-of-the-art outcome.
Research data management is rapidly becoming a regular concern for researchers, and institutions need to provide them with platforms to support data organization and preparation for publication. Some institutions have adopted institutional repositories as the basis for data deposit, whereas others are experimenting with richer environments for data description, in spite of the diversity of existing workflows. This paper is a synthetic overview of current platforms that can be used for data management purposes. Adopting a pragmatic view on data management, the paper focuses on solutions that can be adopted in the longtail of science, where investments in tools and manpower are modest. First, a broad set of data management platforms is presented-some designed for institutional repositories and digital libraries-to select a short list of the more promising ones for data management. These platforms are compared considering This paper is an extended version of a previously published comparative study. Please refer to the WCIST 2015 conference proceedings
Abstract. Research data management is acknowledged as an important concern for institutions and several platforms to support data deposits have emerged. In this paper we start by overviewing the current practices in the data management workflow and identifying the stakeholders in this process. We then compare four recently proposed data repository platforms-DSpace, CKAN, Zenodo and Figshare-considering their architecture, support for metadata, API completeness, as well as their search mechanisms and community acceptance. To evaluate these features, we take into consideration the identified stakeholders' requirements. In the end, we argue that, depending on local requirements, different data repositories can meet some of the stakeholders requirements. Nevertheless, there is still room for improvements, mainly regarding the compatibility with the description of data from different research domains, to further improve data reuse.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.