2018
DOI: 10.1162/tacl_a_00041
|View full text |Cite
|
Sign up to set email alerts
|

Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science

Abstract: In this paper, we propose data statements as a design solution and professional practice for natural language processing technologists, in both research and development. Through the adoption and widespread use of data statements, the field can begin to address critical scientific and ethical issues that result from the use of data from certain populations in the development of technology for other populations. We present a form that data statements can take and explore the implications of adopting them as part… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

5
362
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
7
2

Relationship

0
9

Authors

Journals

citations
Cited by 556 publications
(429 citation statements)
references
References 38 publications
5
362
0
Order By: Relevance
“…Finally, the fact sheet discusses the validation process used to evaluate and prove the effectiveness of an explainability approach by describing a user study or synthetic approach that was carried out. Since the Fact Sheet can evolve over time, we encourage their creators to systematically version them [2], thereby making their recipients aware of any updates. Furthermore, indicating whether the whole Fact Sheet or some of its parts are with respect to a (theoretical) algorithmic approach, an actual implementation or a mixture of the two will benefit its clarity.…”
Section: Explainability Fact Sheets Dimensionsmentioning
confidence: 99%
See 1 more Smart Citation
“…Finally, the fact sheet discusses the validation process used to evaluate and prove the effectiveness of an explainability approach by describing a user study or synthetic approach that was carried out. Since the Fact Sheet can evolve over time, we encourage their creators to systematically version them [2], thereby making their recipients aware of any updates. Furthermore, indicating whether the whole Fact Sheet or some of its parts are with respect to a (theoretical) algorithmic approach, an actual implementation or a mixture of the two will benefit its clarity.…”
Section: Explainability Fact Sheets Dimensionsmentioning
confidence: 99%
“…Another approach towards clarifying explainability properties in ML is self-reporting and certification. Approaches such as "data statements" [2], "data sheets for data sets" [11] and "nutrition labels for data sets" [16] can help to characterise a data set in a coherent way. Kelley et al [19] argued for a similar concept ("nutrition labels for privacy") to assess privacy of systems that handle personal (and sensitive) information.…”
Section: Related Workmentioning
confidence: 99%
“…However, no multilingual corpora were found, outlining a potential opportunity in developing resources for the area. Notice that the focus of the area on the English language (particularly, based on prose using in journalistic or encyclopedic texts) and its particular characteristics, although natural from an engineering point of view due to the availability of resources, may induce important bias in the area [70]. The use of multilingual resources, as proposed in this work, may come as a solution to this problem, leading to more robust and linguistically supported methods and applications.…”
Section: Answer To Rq4: What Are the Available Multilingual Open Ie Dmentioning
confidence: 99%
“…Gebru et al proposed Datasheets for Datasets [2], a standardized reporting schema for data sets in machine learning, including criteria such as the original motivation for data collection, as well as collection procedure, summary of content, and privacy considerations. Similarly, Bender and Friedman proposed Data Statements [7] specifically tailored toward data sets for natural language processing, adding for instance speaker characteristics. Both Datasheets and Data Statements are manually constructed (whereas we envision generating ML consumer labels semi-automatically).…”
Section: Related Workmentioning
confidence: 99%