“…In recent years, studies on question answering (QA) have successfully benefited from deep neural networks, and showed remarkable performance improvement on textQA [24,30], imageQA [2,3,19,31], videoQA [8,11,32,34]. This paper considers movie story QA [15,18,21,26,29] that aims at a joint understanding of vision and language by answering questions about movie contents and storyline after observing temporally-aligned video and subtitle. Movie * This research was supported by Samsung Research story QA is challenging compared to VQA in following two aspects: (1) pinpointing the temporal parts relevant to answer the question is difficult as the movies are typically longer than an hour and (2) it has both video and subtitle where different questions require different modality to infer the answer.…”