Reinforcement learning (RL) agents are commonly evaluated via their expected value over a distribution of test scenarios. Unfortunately, this evaluation approach provides limited evidence for post-deployment generalization beyond the test distribution. In this paper, we address this limitation by extending the recent CHECKLIST testing methodology from natural language processing to planning-based RL. Specifically, we consider testing RL agents that make decisions via online tree search using a learned transition model and value function. The key idea is to improve the assessment of future performance via a CHECKLIST approach for exploring and assessing the agent's inferences during tree search. The approach provides the user with an interface and general queryrule mechanism for identifying potential inference flaws and validating expected inference invariances. We present a user study involving knowledgeable AI researchers using the approach to evaluate an agent trained to play a complex realtime strategy game. The results show the approach is effective in allowing users to identify previously-unknown flaws in the agent's reasoning. In addition, our analysis provides insight into how AI experts use this type of testing approach, which may help improve future instantiations.
Fig. 1. With DendroMap, users can explore large-scale image datasets by overviewing the overall distributions and zooming down into hierarchies of image groups at multiple levels of abstraction. In this example, we visualize images of the CIFAR-100 dataset by hierarchically clustering the image representations obtained from a ResNet50 image classification model. (B) DendroMap View displays these clusters of images organized as a hierarchical structure by adapting Treemaps. By clicking on a cluster, a user can interactively (C) Zoom into that image group, revealing subgroups that replace and fill the available space with animation (see the submitted video).The user clicked on a cluster for organism images, which creates distinct subgroups of fish, insects, worms, fruits, and flowers. With (A) Sidebar View, the user can dynamically adjust the number of clusters to be displayed and inspect the class-level statistics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.