2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017
DOI: 10.1109/cvpr.2017.475
|View full text |Cite
|
Sign up to set email alerts
|

GuessWhat?! Visual Object Discovery through Multi-modal Dialogue

Abstract: We introduce GuessWhat?!, a two-player guessing game as a testbed for research on the interplay of computer vision and dialogue systems. The goal of the game is to locate an unknown object in a rich image scene by asking a sequence of questions. Higher-level image understanding, like spatial reasoning and language grounding, is required to solve the proposed task. Our key contribution is the collection of a large-scale dataset consisting of 150K human-played games with a total of 800K visual question-answer pa… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

3
427
0
1

Year Published

2018
2018
2024
2024

Publication Types

Select...
3
3
2

Relationship

1
7

Authors

Journals

citations
Cited by 321 publications
(431 citation statements)
references
References 41 publications
3
427
0
1
Order By: Relevance
“…Early works use reconstruction based approach [40] or integrate global context with the spatial configurations [19]. Recent approaches [7,34,49] learn directly in the multi-1 https://github.com/TheShadow29/zsgnet-pytorch modal feature space and use attention mechanisms [9,45] which have also been extended to phrase grounding in dialogue systems [8,52]. Few approaches also look at unsupervised learning using variational context [50] and semisupervised learning via gating mechanisms [6].…”
Section: Related Workmentioning
confidence: 99%
“…Early works use reconstruction based approach [40] or integrate global context with the spatial configurations [19]. Recent approaches [7,34,49] learn directly in the multi-1 https://github.com/TheShadow29/zsgnet-pytorch modal feature space and use attention mechanisms [9,45] which have also been extended to phrase grounding in dialogue systems [8,52]. Few approaches also look at unsupervised learning using variational context [50] and semisupervised learning via gating mechanisms [6].…”
Section: Related Workmentioning
confidence: 99%
“…• In a fully-observable context (De Vries et al 2017), it is usually given that every information about the context is shared among the agents. This makes common grounding easier because information about the context is already in their common ground, and there could be little chance of misunderstandings.…”
Section: A's View B's Viewmentioning
confidence: 99%
“…In this approach, dialogue state and utterance generation are learned directly from large raw corpora with little prior constraints, so they are more suitable for complex common grounding where flexibility is a requirement. However, few existing tasks focus on the difficulty of common grounding and most are based on either fully-observable or categorical context (De Vries et al 2017;Bordes and Weston 2016;Lewis et al 2017) where difficult common grounding is not required. A dataset closest to our setting is the Mutu-alFriends dataset (He et al 2017), which is based on the task of finding a mutual friend from private lists of friends.…”
Section: Related Workmentioning
confidence: 99%
“…GuessWhat?! [5] is a collaborative 2-player visual grounded object discovery game. The game begins with presenting an image I of a rich visual scene containing M objects C = {c m } M m=1 to both players, the questioner and the answerer.…”
Section: Related Workmentioning
confidence: 99%
“…Research on goal-oriented visual dialogue [1,5] has recently attracted lots of attention. Unlike the conventional VQA [8], where the robot answerer has to answer any question related to an input image raised by a human even if the question itself is ambiguous or indefinite, the goal-oriented visual dialogue extends the question-answering interactions to multiple rounds, turning the robot into also a questioner which can retrieve more information from the human.…”
Section: Introductionmentioning
confidence: 99%