Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Tilman, Räuker,; Ho, Anson T. Y.; Casper, Stephen T.; Hadfield-Menell, Dylan

doi:10.48550/arxiv.2207.13243

Cited by 3 publications

(1 citation statement)

References 172 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Mechanistic Interpretability. Mechanistic interpretation explains how LMs work by reverse engineering, i.e., reconstructing LMs with different components (Räuker et al, 2022). A recent line of work provides interpretation focusing on the LM's weights and intermediate representations (Olah et al, 2017(Olah et al, , 2018(Olah et al, , 2020.…”

Section: Related Workmentioning

confidence: 99%

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models

Hou,

Li,

Fei

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Recent work has shown that language models (LMs) have strong multi-step (i.e., procedural) reasoning capabilities. However, it is unclear whether LMs perform these tasks by cheating with answers memorized from pretraining corpus, or, via a multi-step reasoning mechanism. In this paper, we try to answer this question by exploring a mechanistic interpretation of LMs for multi-step reasoning tasks. Concretely, we hypothesize that the LM implicitly embeds a reasoning tree resembling the correct reasoning process within it. We test this hypothesis by introducing a new probing approach (called MechanisticProbe) that recovers the reasoning tree from the model's attention patterns. We use our probe to analyze two LMs: GPT-2 on a synthetic task (k-th smallest element), and LLaMA on two simple language-based reasoning tasks (ProofWriter & AI2 Reasoning Challenge). We show that MechanisticProbe is able to detect the information of the reasoning tree from the model's attentions for most examples, suggesting that the LM indeed is going through a process of multi-step reasoning within its architecture in many cases. 1

show abstract

Section: Related Workmentioning

confidence: 99%

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models

Hou,

Li,

Fei

et al. 2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

Explaining AI through mechanistic interpretability

Kästner,

Crook

2024

Euro Jnl Phil Sci

View full text Add to dashboard Cite

Recent work in explainable artificial intelligence (XAI) attempts to render opaque AI systems understandable through a divide-and-conquer strategy. However, this fails to illuminate how trained AI systems work as a whole. Precisely this kind of functional understanding is needed, though, to satisfy important societal desiderata such as safety. To remedy this situation, we argue, AI researchers should seek mechanistic interpretability, viz. apply coordinated discovery strategies familiar from the life sciences to uncover the functional organisation of complex AI systems. Additionally, theorists should accommodate for the unique costs and benefits of such strategies in their portrayals of XAI research.

show abstract

From Discovery to Adoption: Understanding the ML Practitioners’ Interpretability Journey

Ashtari

Mullins

Qian

et al. 2023

Proceedings of the 2023 ACM Designing Interactive Systems Conference

View full text Add to dashboard Cite

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

Cited by 3 publications

References 172 publications

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models

Towards a Mechanistic Interpretation of Multi-Step Reasoning Capabilities of Language Models

Explaining AI through mechanistic interpretability

From Discovery to Adoption: Understanding the ML Practitioners’ Interpretability Journey

Contact Info

Product

Resources

About