“…The majority of work on interpretability so far has focused on (i), providing post-hoc explanations for a given prediction model. These include pixel attribution methods [Simonyan et al, 2014, Bach et al, 2015, Selvaraju et al, 2017, counterfactual explanations [Chang et al, 2019, Antoran et al, 2021, explanations based on pre-defined concepts , Kazhdan et al, 2020, Yeh et al, 2020, and recently developed StyleGANs [Wu et al, 2021, Lang et al, 2021. Post-hoc methods have a number of shortcomings given our desired objectives: First, it is unclear whether post-hoc explanations indeed reflect the black-box model's true "reasoning" [Rudin, 2018.…”