Towards falsifiable interpretability research

Leavitt, Matthew L.; Morcos, Ari

doi:10.48550/arxiv.2010.12016

Cited by 9 publications

(21 citation statements)

References 63 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This lack of consensus is worrying, as measures are often designed according to different and incompatible intuitive desiderata, such as whether finding a one-to-one assignment, or finding few-to-one mappings, between neurons is more appropriate [17]. As a community, we need well-chosen formal criteria for evaluating metrics to avoid over-reliance on intuition and the pitfalls of too many researcher degrees of freedom [14].…”

Section: Introductionmentioning

confidence: 99%

Grounding Representation Similarity with Statistical Testing

Ding¹,

Denain²,

Steinhardt

2021

Preprint

View full text Add to dashboard Cite

To understand neural network behavior, recent works quantitatively compare different networks' learned representations using canonical correlation analysis (CCA), centered kernel alignment (CKA), and other dissimilarity measures. Unfortunately, these widely used measures often disagree on fundamental observations, such as whether deep networks differing only in random initialization learn similar representations. These disagreements raise the question: which, if any, of these dissimilarity measures should we believe? We provide a framework to ground this question through a concrete test: measures should have sensitivity to changes that affect functional behavior, and specificity against changes that do not. We quantify this through a variety of functional behaviors including probing accuracy and robustness to distribution shift, and examine changes such as varying random initialization and deleting principal components. We find that current metrics exhibit different weaknesses, note that a classical baseline performs surprisingly well, and highlight settings where all metrics appear to fail, thus providing a challenge set for further improvement.Preprint. Under review.

show abstract

Section: Introductionmentioning

confidence: 99%

Grounding Representation Similarity with Statistical Testing

Ding¹,

Denain²,

Steinhardt

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…Taken together, our empirical results show that the widely used visualization method by Olah et al ( 2017) is more limited in its ability to convey causal understanding of CNN activations than previously assumed. This reinforces the importance of testing falsifiable hypotheses in the field of interpretable artificial intelligence (Leavitt & Morcos, 2020). Feature visualizations certainly have an important place within the fields of interpretability and explainability, and their importance is likely to grow further with increased societal applications of machine learning.…”

Section: Discussionmentioning

confidence: 63%

“…Explanation methods such as feature visualizations have been criticized as intuition-driven (Leavitt & Morcos, 2020), and it is unclear whether they allow humans to gain a precise understanding of which image features "cause" high activation in a unit. Here, we propose an objective psychophysical task to quantify how well these synthetic images support causal understanding of CNN units.…”

Section: Discussionmentioning

confidence: 99%

“…Feature visualizations are a widely used method to understand the learned representations and decision-making of CNNs (Mahendran & Vedaldi, 2015;Nguyen et al, 2015;Mordvintsev et al, 2015;Nguyen et al, 2016;Tsipras et al, 2019;Engstrom et al, 2019;Olah et al, 2017;Nguyen et al, 2019). However, others question whether this synthetic visualization technique, first introduced by Erhan et al (2009), is too intuition-driven (Leavitt & Morcos, 2020), and how representative the appealing visualizations in publications are (Kriegeskorte, 2015). Further, as already mentioned above, the engineering of the loss function may influence their faithfulness (Nguyen et al, 2017).…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

How Well do Feature Visualizations Support Causal Understanding of CNN Activations?

Zimmermann¹,

Borowski²,

Geirhos³

et al. 2021

Preprint

View full text Add to dashboard Cite

One widely used approach towards understanding the inner workings of deep convolutional neural networks is to visualize unit responses via activation maximization. Feature visualizations via activation maximization are thought to provide humans with precise information about the image features that cause a unit to be activated. If this is indeed true, these synthetic images should enable humans to predict the effect of an intervention, such as whether occluding a certain patch of the image (say, a dog's head) changes a unit's activation. Here, we test this hypothesis by asking humans to predict which of two square occlusions causes a larger change to a unit's activation. Both a large-scale crowdsourced experiment and measurements with experts show that on average, the extremely activating feature visualizations by Olah et al. (2017) indeed help humans on this task (67 ± 4 % accuracy; baseline performance without any visualizations is 60 ± 3 %). However, they do not provide any significant advantage over other visualizations (such as e.g. dataset samples), which yield similar performance (66 ± 3 % to 67 ± 3 % accuracy). Taken together, we propose an objective psychophysical task to quantify the benefit of unit-level interpretability methods for humans, and find no evidence that feature visualizations provide humans with better "causal understanding" than simple alternative visualizations.

show abstract

“…Development of methods for relating complex later-layer factors to well-understood early layer factors is an important priority for further interpretability work in complex domains. Finally, we note that although these interpretations of the above factors bear out in the majority of the randomly selected positions shown in the online database of factors, an interpretation can only be considered definitive once it has been quantitatively validated [91], ideally by intervening on the input.…”

Section: Resultsmentioning

confidence: 98%

Acquisition of Chess Knowledge in AlphaZero

McGrath¹,

Kapishnikov²,

Tomašev³

et al. 2021

Preprint

View full text Add to dashboard Cite

What is learned by sophisticated neural network agents such as AlphaZero? This question is of both scientific and practical interest. If the representations of strong neural networks bear no resemblance to human concepts, our ability to understand faithful explanations of their decisions will be restricted, ultimately limiting what we can achieve with neural network interpretability. In this work we provide evidence that human knowledge is acquired by the AlphaZero neural network as it trains on the game of chess. By probing for a broad range of human chess concepts we show when and where these concepts are represented in the AlphaZero network. We also provide a behavioural analysis focusing on opening play, including qualitative analysis from chess Grandmaster Vladimir Kramnik. Finally, we carry out a preliminary investigation looking at the low-level details of AlphaZero's representations, and make the resulting behavioural and representational analyses available online.

show abstract

Towards falsifiable interpretability research

Cited by 9 publications

References 63 publications

Grounding Representation Similarity with Statistical Testing

Grounding Representation Similarity with Statistical Testing

How Well do Feature Visualizations Support Causal Understanding of CNN Activations?

Acquisition of Chess Knowledge in AlphaZero

Contact Info

Product

Resources

About