MarioQA: Answering Questions by Watching Gameplay Videos

Mun, Jonghwan; Seo, Paul Hongsuck; Jung, Ilchae; Han, Bohyung

doi:10.1109/iccv.2017.312

Cited by 82 publications

(51 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Comparing with images, temporal domain is unique to videos. A temporal attention mechanism is leveraged to selectively attend to one or more periods of a video in [16,24,35]. Besides temporal attention mecha- Figure 2.…”

Section: Related Workmentioning

confidence: 99%

Motion-Appearance Co-memory Networks for Video Question Answering

Gao

Chen

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

233

204

View full text Add to dashboard Cite

Time 2nd Pass 1st Pass Figure 1. Answering questions in videos involves both motion and appearance analysis, and usually requires multiple cycles of reasoning, especially for transitive questions, e.g. " What does the woman do after look uncertain?", we need to first localize when the woman looks uncertain, which requires motion evidence for looking uncertain and appearance evidence for the woman; and then focus on what the woman does (smile). AbstractVideo Question Answering (QA) is an important task in understanding video temporal structure. We observe that there are three unique attributes of video QA compared with image QA: (1) it deals with long sequences of images containing richer information not only in quantity but also in variety; (2) motion and appearance information are usually correlated with each other and able to provide useful attention cues to the other; (3) different questions require different number of frames to infer the answer. Based on these observations, we propose a motion-appearance comemory network for video QA. Our networks are built on concepts from Dynamic Memory Network (DMN) and introduces new mechanisms for video QA. Specifically, there are three salient aspects: (1) a co-memory attention mechanism that utilizes cues from both motion and appearance to generate attention; (2) a temporal conv-deconv network to generate multi-level contextual facts; (3) a dynamic fact ensemble method to construct temporal representation dynamically for different questions. We evaluate our method on TGIF-QA dataset, and the results outperform state-ofthe-art significantly on all four tasks of TGIF-QA. * indicates equal contributions.

show abstract

Section: Related Workmentioning

confidence: 99%

Motion-Appearance Co-memory Networks for Video Question Answering

Gao

Chen

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

233

204

View full text Add to dashboard Cite

show abstract

“…More recently, modular networks [4,20,25] that construct an explicit representation of the reasoning process by exploiting the compositional nature of language have been proposed. Similar architectures have also been applied to the video domain with extensions such as spatiotemporal attention [23,49]. Our proposed approach to question answering allows the agent to interact with its environment and is thus fundamentally different to past QA approaches.…”

Section: Related Workmentioning

confidence: 99%

IQA: Visual Question Answering in Interactive Environments

Gordon

Kembhavi

Rastegari

et al. 2018

2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition

316

295

View full text Add to dashboard Cite

We introduce Interactive Question Answering (IQA), the task of answering questions that require an autonomous agent to interact with a dynamic visual environment. IQA presents the agent with a scene and a question, like: "Are there any apples in the fridge?" The agent must navigate around the scene, acquire visual understanding of scene elements, interact with objects (e.g. open refrigerators) and plan for a series of actions conditioned on the question. Popular reinforcement learning approaches with a single controller perform poorly on IQA owing to the large and diverse state space. We propose the Hierarchical Interactive Memory Network (HIMN), consisting of a factorized set of controllers, allowing the system to operate at multiple levels of temporal abstraction. To evaluate HIMN, we introduce IQUAD V1, a new dataset built upon AI2-THOR [35], a simulated photo-realistic environment of configurable indoor scenes with interactive objects. 1 IQUAD V1 has 75,000 questions, each paired with a unique scene configuration. Our experiments show that our proposed model outperforms popular single controller based methods on IQUAD V1. For sample questions and results, please view our video: https://youtu.be/pXd3C-1jr98.

show abstract

“…In the video domain, the TGIF-QA (Jang et al, 2017) and Mario-QA (Mun et al, 2016) datasets provide opportunities to study temporal reasoning for the task of VQA. The TGIF-QA dataset considers three types of temporal questions: before/after questions, repetition count, and determining a repeating action.…”

Section: Related Workmentioning

confidence: 99%

Localizing Moments in Video with Temporal Language

Hendricks¹,

Wang²,

Shechtman³

et al. 2018

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

120

107

View full text Add to dashboard Cite

Localizing moments in a longer video via natural language queries is a new, challenging task at the intersection of language and video understanding. Though moment localization with natural language is similar to other language and vision tasks like natural language object retrieval in images, moment localization offers an interesting opportunity to model temporal dependencies and reasoning in text.We propose a new model that explicitly reasons about different temporal segments in a video, and shows that temporal context is important for localizing phrases which include temporal language. To benchmark whether our model, and other recent video localization models, can effectively reason about temporal language, we collect the novel TEMPOral reasoning in video and language (TEMPO) dataset. Our dataset consists of two parts: a dataset with real videos and template sentences (TEMPO -Template Language) which allows for controlled studies on temporal language, and a human language dataset which consists of temporal sentences annotated by humans (TEMPO -Human Language).

show abstract

MarioQA: Answering Questions by Watching Gameplay Videos

Cited by 82 publications

References 22 publications

Motion-Appearance Co-memory Networks for Video Question Answering

Motion-Appearance Co-memory Networks for Video Question Answering

IQA: Visual Question Answering in Interactive Environments

Localizing Moments in Video with Temporal Language

Contact Info

Product

Resources

About