Arthur Conmy scite author profile

Arthur Conmy

3Publications

23Citation Statements Received

5Citation Statements Given

How they've been cited

How they cite others

Affiliations

University of Cambridge

Publications

Order By: Most citations

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Wang¹,

Variengien²,

Conmy³

et al. 2022

Preprint

View full text Add to dashboard Cite

Research in mechanistic interpretability seeks to explain behaviors of machine learning (ML) models in terms of their internal components. However, most previous work either focuses on simple behaviors in small models or describes complicated behaviors in larger models with broad strokes. In this work, we bridge this gap by presenting an explanation for how GPT-2 small performs a natural language task called indirect object identification (IOI). Our explanation encompasses 26 attention heads grouped into 7 main classes, which we discovered using a combination of interpretability approaches relying on causal interventions. To our knowledge, this investigation is the largest end-to-end attempt at reverse-engineering a natural behavior "in the wild" in a language model. We evaluate the reliability of our explanation using three quantitative criteria-faithfulness, completeness and minimality. Though these criteria support our explanation, they also point to remaining gaps in our understanding. Our work provides evidence that a mechanistic understanding of large ML models is feasible, pointing toward opportunities to scale our understanding to both larger models and more complex tasks. Code for all experiments is available at https://github.com/redwoodresearch/Easy-Transformer.

show abstract

Stylegan-Induced Data-Driven Regularization for Inverse Problems

Conmy

Mukherjee

Schönlieb

2022

View full text Add to dashboard Cite

StyleGAN-induced data-driven regularization for inverse problems

Conmy¹,

Mukherjee²,

Schönlieb³

2021

Preprint

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Arthur Conmy

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

Stylegan-Induced Data-Driven Regularization for Inverse Problems

StyleGAN-induced data-driven regularization for inverse problems

Contact Info

Product

Resources

About