Datasheets for datasets

Gebru, Timnit; Morgenstern, Jamie; Vecchione, Briana; Vaughan, J.; Wallach, Hanna; Daumé, Hal; Crawford, Kate

doi:10.1145/3458723

Cited by 888 publications

(641 citation statements)

References 5 publications

Supporting

Mentioning

525

Contrasting

Unclassified

Order By: Relevance

“…Indeed, better provenance of the process by which data is generated will be critical in order to disentangle the source of dataset differences (for example, if clinical practices or environmental and social factors are giving rise to different healthcare measures and outcomes). Following guidelines developed for documenting datasets (72) and models (73) in the machine learning community, similar guidelines should be established for models in healthcare as well (10). An example is the proposal for reporting subgroup-level performances in MI-CLAIM checklist (74).…”

Section: Discussionmentioning

confidence: 99%

Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

Singh

Mhasawade

Chunara

2021

Preprint

View full text Add to dashboard Cite

Importance: Modern predictive models require large amounts of data for training and evaluation which can result in building models that are specific to certain locations, populations in them and clinical practices. Yet, best practices and guidelines for clinical risk prediction models have not yet considered such challenges to generalizability. Objectives: To investigate changes in measures of predictive discrimination, calibration, and algorithmic fairness when transferring models for predicting in-hospital mortality across ICUs in different populations. Also, to study the reasons for the lack of generalizability in these measures. Design, Setting, and Participants: In this multi-center cross-sectional study, electronic health records from 179 hospitals across the US with 70,126 hospitalizations were analyzed. Time of data collection ranged from 2014 to 2015. Main Outcomes and Measures: The main outcome is in-hospital mortality. Generalization gap, defined as difference between model performance metrics across hospitals, is computed for discrimination and calibration metrics, namely area under the receiver operating characteristic curve (AUC) and calibration slope. To assess model performance by race variable, we report differences in false negative rates across groups. Data were also analyzed using a causal discovery algorithm "Fast Causal Inference" (FCI) that infers paths of causal influence while identifying potential influences associated with unmeasured variables. Results: In-hospital mortality rates differed in the range of 3.9%-9.3% (1st-3rd quartile) across hospitals. When transferring models across hospitals, AUC at the test hospital ranged from 0.777 to 0.832 (1st to 3rd quartile; median 0.801); calibration slope from 0.725 to 0.983 (1st to 3rd quartile; median 0.853); and disparity in false negative rates from 0.046 to 0.168 (1st to 3rd quartile; median 0.092). When transferring models across geographies, AUC ranged from 0.795 to 0.813 (1st to 3rd quartile; median 0.804); calibration slope from 0.904 to 1.018 (1st to 3rd quartile; median 0.968); and disparity in false negative rates from 0.018 to 0.074 (1st to 3rd quartile; median 0.040). Distribution of all variable types (demography, vitals, and labs) differed significantly across hospitals and regions. Shifts in the race variable distribution and some clinical (vitals, labs and surgery) variables by hospital or region. Race variable also mediates differences in the relationship between clinical variables and mortality, by hospital/region. Conclusions and Relevance: Group-specific metrics should be assessed during generalizability checks to identify potential harms to the groups. In order to develop methods to improve and guarantee performance of prediction models in new environments for groups and individuals, better understanding and provenance of health processes as well as data generating processes by sub-group are needed to identify and mitigate sources of variation.

show abstract

Section: Discussionmentioning

confidence: 99%

Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

Singh

Mhasawade

Chunara

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…In line with these observations, Gebru et al. [ 10 ] propose the use of datasheets for datasets. They suggest that every dataset should be accompanied by a datasheet that documents its motivation, composition, collection process, recommended uses, and other important aspects, with the ultimate goal of increasing transparency and accountability within the community, mitigating unwanted biases in ML systems, and encouraging reproducibility of ML experiments.…”

Section: Challenges and Opportunities For Research Parasites In The Building Of Fair ML Systemsmentioning

confidence: 95%

On the relationship between research parasites and fairness in machine learning: challenges and opportunities

et al. 2021

View full text Add to dashboard Cite

Machine learning systems influence our daily lives in many different ways. Hence, it is crucial to ensure that the decisions and recommendations made by these systems are fair, equitable, and free of unintended biases. Over the past few years, the field of fairness in machine learning has grown rapidly, investigating how, when, and why these models capture, and even potentiate, biases that are deeply rooted not only in the training data but also in our society. In this Commentary, we discuss challenges and opportunities for rigorous posterior analyses of publicly available data to build fair and equitable machine learning systems, focusing on the importance of training data, model construction, and diversity in the team of developers. The thoughts presented here have grown out of the work we did, which resulted in our winning the annual Research Parasite Award that GigaSciencesponsors.

show abstract

“…When creating a new dataset or challenge, it is advisable to document the dataset with its characteristics, and thus possible model limitations. Possibilities include data sheets [Gebru et al, 2018], which describe the data collection procedure, and model cards [Mitchell et al, 2019], which describe the choices made to train a model (including the data).…”

Section: Let Us Build Awareness Of Data Limitationsmentioning

confidence: 99%

How I failed machine learning in medical imaging -- shortcomings and recommendations

Varoquaux,

Cheplygina

2021

Preprint

View full text Add to dashboard Cite

Medical imaging is an important research field with many opportunities for improving patients' health. However, there are a number of challenges that are slowing down the progress of the field as a whole, such optimizing for publication. In this paper we reviewed several problems related to choosing datasets, methods, evaluation metrics, and publication strategies. With a review of literature and our own analysis, we show that at every step, potential biases can creep in. On a positive note, we also see that initiatives to counteract these problems are already being started. Finally we provide a broad range of recommendations on how to further these address problems in the future. For reproducibility, data and code for our analyses are available on https://github.com/GaelVaroquaux/ml med imaging failures.

show abstract

Datasheets for datasets

Abstract: Documentation to facilitate communication between dataset creators and consumers.

Cited by 888 publications

References 5 publications

Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

Generalizability Challenges of Mortality Risk Prediction Models: A Retrospective Analysis on a Multi-center Database

On the relationship between research parasites and fairness in machine learning: challenges and opportunities

How I failed machine learning in medical imaging -- shortcomings and recommendations

Contact Info

Product

Resources

About