As machine learning black boxes are increasingly being deployed in domains such as healthcare and criminal justice, there is growing emphasis on building tools and techniques for explaining these black boxes in an interpretable manner. Such explanations are being leveraged by domain experts to diagnose systematic errors and underlying biases of black boxes. In this paper, we demonstrate that post hoc explanations techniques that rely on input perturbations, such as LIME and SHAP, are not reliable. Specifically, we propose a novel scaffolding technique that effectively hides the biases of any given classifier by allowing an adversarial entity to craft an arbitrary desired explanation. Our approach can be used to scaffold any biased classifier in such a way that its predictions on the input data distribution still remain biased, but the post hoc explanations of the scaffolded classifier look innocuous. Using extensive evaluation with multiple real world datasets (including COMPAS), we demonstrate how extremely biased (racist) classifiers crafted by our framework can easily fool popular explanation techniques such as LIME and SHAP into generating innocuous explanations which do not reflect the underlying biases. CCS CONCEPTS• Computing methodologies → Machine learning; Supervised learning by classification; • Human-centered computing → Interactive systems and tools. KEYWORDSblack box explanations, model interpretability, bias detection, adversarial attacks ACM Reference Format:
Social media platforms are increasingly deploying complex interventions to help users detect false news. Labeling false news using techniques that combine crowd-sourcing with artificial intelligence (AI) offers a promising way to inform users about potentially low-quality information without censoring content, but also can be hard for users to understand. In this study, we examine how users respond in their sharing intentions to information they are provided about a hypothetical human-AI hybrid system. We ask i) if these warnings increase discernment in social media sharing intentions and ii) if explaining how the labeling system works can boost the effectiveness of the warnings. To do so, we conduct a study (N=1473 Americans) in which participants indicated their likelihood of sharing content. Participants were randomly assigned to a control, a treatment where false content was labeled, or a treatment where the warning labels came with an explanation of how they were generated. We find clear evidence that both treatments increase sharing discernment, and directional evidence that explanations increase the warnings' effectiveness. Interestingly, we do not find that the explanations increase self-reported trust in the warning labels, although we do find some evidence that participants found the warnings with the explanations to be more informative. Together, these results have important implications for designing and deploying transparent misinformation warning labels, and AI-mediated systems more broadly.
Counterfactual explanations are emerging as an attractive option for providing recourse to individuals adversely impacted by algorithmic decisions. As they are deployed in critical applications (e.g. law enforcement, financial lending), it becomes important to ensure that we clearly understand the vulnerabilties of these methods and find ways to address them. However, there is little understanding of the vulnerabilities and shortcomings of counterfactual explanations. In this work, we introduce the first framework that describes the vulnerabilities of counterfactual explanations and shows how they can be manipulated. More specifically, we show counterfactual explanations may converge to drastically different counterfactuals under a small perturbation indicating they are not robust. Leveraging this insight, we introduce a novel objective to train seemingly fair models where counterfactual explanations find much lower cost recourse under a slight perturbation. We describe how these models can unfairly provide low-cost recourse for specific subgroups in the data while appearing fair to auditors. We perform experiments on loan and violent crime prediction data sets where certain subgroups achieve up to 20x lower cost recourse under the perturbation. These results raise concerns regarding the dependability of current counterfactual explanation techniques, which we hope will inspire investigations in robust counterfactual explanations.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.