“…Causal Abstraction for Explanations of AI Geiger et al (2023) argue that causal abstraction is a generic theoretical framework for providing faithful Lyu et al, 2022) and interpretable (Lipton, 2018) explanations of AI models and show that LIME (Ribeiro et al, 2016), causal effect estimation (Abraham et al, 2022;Feder et al, 2021), causal mediation analysis (Vig et al, 2020;Csordás et al, 2021;De Cao et al, 2021), iterated nullspace projection (Ravfogel et al, 2020;Elazar et al, 2020), and circuit-based explanations (Olah et al, 2020;Olsson et al, 2022;Wang et al, 2022) can all be seen as special cases of causal abstraction analysis. The circuits research program also posits that a linear combination of neural activations, which they term a 'feature', is the fundamental unit in neural networks.…”