2023
DOI: 10.48550/arxiv.2303.02536
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Finding Alignments Between Interpretable Causal Variables and Distributed Neural Representations

Abstract: Causal abstraction is a promising theoretical framework for explainable artificial intelligence that defines when an interpretable high-level causal model is a faithful simplification of a lowlevel deep learning system. However, existing causal abstraction methods have two major limitations: they require a brute-force search over alignments between the high-level model and the low-level one, and they presuppose that variables in the high-level model will align with disjoint sets of neurons in the low-level one… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(4 citation statements)
references
References 21 publications
0
4
0
Order By: Relevance
“…The models are truly learning a symbol, rather than a weaker heuristic. 17 Another potential strategy is to examine the representations that emerge from particular learning inputs, and how these support behaviors. In humans, these learning inputs can only be manipulated in the context of short lab experiments, but with models the learning input can be changed radically in both its quantity and its contents.…”
Section: Llms As Cognitive Modelsmentioning
confidence: 99%
“…The models are truly learning a symbol, rather than a weaker heuristic. 17 Another potential strategy is to examine the representations that emerge from particular learning inputs, and how these support behaviors. In humans, these learning inputs can only be manipulated in the context of short lab experiments, but with models the learning input can be changed radically in both its quantity and its contents.…”
Section: Llms As Cognitive Modelsmentioning
confidence: 99%
“…(ii) a more general setting in which the human's concepts are constrained to be disentangled in blocks; and (iii) an unrestricted setting in which the human concepts can influence each other in arbitrary manners. In addition, we identify a and previously ignored link between interpretability of representations and the notion of causal abstraction [34][35][36].…”
Section: Our Contributionsmentioning
confidence: 99%
“…The presence of a consistency property between C H and C M is what defines a causal abstraction [34,35,60]; see [61] for an overview. Causal abstractions have been proposed to define (approximate) equivalence between causal graphs and have recently been employed in the context of explainable AI [36,62]. The existence of a causal abstraction ensures two systems are interventionally equivariant: interventions on one system can always be mapped (modulo approximations) to equivalent interventions in the other and lead to the same interventional distribution.…”
Section: Alignment and Causal Abstractionsmentioning
confidence: 99%
See 1 more Smart Citation