Today, computer vision systems are tested by their accuracy in detecting and localizing instances of objects. As an alternative, and motivated by the ability of humans to provide far richer descriptions and even tell a story about an image, we construct a "visual Turing test": an operator-assisted device that produces a stochastic sequence of binary questions from a given test image. The query engine proposes a question; the operator either provides the correct answer or rejects the question as ambiguous; the engine proposes the next question ("just-in-time truthing"). The test is then administered to the computer-vision system, one question at a time. After the system's answer is recorded, the system is provided the correct answer and the next question. Parsing is trivial and deterministic; the system being tested requires no natural language processing. The query engine employs statistical constraints, learned from a training set, to produce questions with essentially unpredictable answers-the answer to a question, given the history of questions and their correct answers, is nearly equally likely to be positive or negative. In this sense, the test is only about vision. The system is designed to produce streams of questions that follow natural story lines, from the instantiation of a unique object, through an exploration of its properties, and on to its relationships with other uniquely instantiated objects. G oing back at least to the mid-20th century there has been an active debate about the state of progress in artificial intelligence and how to measure it. Alan Turing (1) proposed that the ultimate test of whether a machine could "think," or think at least as well as a person, was for a human judge to be unable to tell which was which based on natural language conversations in an appropriately cloaked scenario. In a much-discussed variation (sometimes called the "standard interpretation"), the objective is to measure how well a computer can imitate a human (2) in some circumscribed task normally associated with intelligent behavior, although the practical utility of "imitation" as a criterion for performance has also been questioned (3). In fact, the overwhelming focus of the modern artificial intelligence (AI) community has been to assess machine performance more directly by dedicated tests for specific tasks rather than debating about general "thinking" or Turing-like competitions between people and machines.In this paper we implement a new, query-based test for computer vision, one of the most vibrant areas of modern AI research. Throughout this paper we use "computer vision" more or less synonymously with semantic image interpretation-"images to words." However, of course computer vision encompasses a great many other activities; it includes the theory and practice of image formation ("sensors to images"), image processing ("images to images"), mathematical representations, video processing, metric scene reconstruction, and so forth. In fact, it may not be possible to interpret scenes at a semantic leve...