Large language models [5,7] have demonstrated an emergent capability in answering knowledge intensive questions. With recent progress on web-scale visual and language pre-training [2,6,38], do these models also understand how to answer visual information seeking questions? To answer this question, we present INFOSEEK 1 , aVisual Question Answering dataset that focuses on asking information-seeking questions, where the information can not be answered by common sense knowledge. We perform a multi-stage human annotation to collect a natural distribution of high-quality visual information seeking questionanswer pairs. We also construct a large-scale, automatically collected dataset by combining existing visual entity recognition datasets and Wikidata, which provides over one million examples for model fine-tuning and validation.Based on INFOSEEK, we analyzed various pre-trained Visual QA systems to gain insights into the characteristics of different pre-trained models. Our analysis shows that it is challenging for the state-of-the-art multi-modal pre-trained models to answer visual information seeking questions, but this capability is improved through fine-tuning on the automated INFOSEEK dataset. We hope our analysis paves the way to understand and develop the next generation of multi-modal pre-training.