GuessWhat?! Visual Object Discovery through Multi-modal Dialogue

Vries, Harm de; Strub, Florian; Chandar, Sarath; Pietquin, Olivier; Larochelle, Hugo; Courville, Aaron

doi:10.1109/cvpr.2017.475

Cited by 321 publications

(431 citation statements)

References 41 publications

Supporting

Mentioning

427

Contrasting

Unclassified

Order By: Relevance

“…Early works use reconstruction based approach [40] or integrate global context with the spatial configurations [19]. Recent approaches [7,34,49] learn directly in the multi-1 https://github.com/TheShadow29/zsgnet-pytorch modal feature space and use attention mechanisms [9,45] which have also been extended to phrase grounding in dialogue systems [8,52]. Few approaches also look at unsupervised learning using variational context [50] and semisupervised learning via gating mechanisms [6].…”

Section: Related Workmentioning

confidence: 99%

Zero-Shot Grounding of Objects From Natural Language Queries

Sadhu

Chen

Nevatia³

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

148

View full text Add to dashboard Cite

A phrase grounding system localizes a particular object in an image referred to by a natural language query. In previous work, the phrases were restricted to have nouns that were encountered in training, we extend the task to Zero-Shot Grounding(ZSG) which can include novel, "unseen" nouns. Current phrase grounding systems use an explicit object detection network in a 2-stage framework where one stage generates sparse proposals and the other stage evaluates them. In the ZSG setting, generating appropriate proposals itself becomes an obstacle as the proposal generator is trained on the entities common in the detection and grounding datasets. We propose a new single-stage model called ZSGNet which combines the detector network and the grounding system and predicts classification scores and regression parameters. Evaluation of ZSG system brings additional subtleties due to the influence of the relationship between the query and learned categories; we define four distinct conditions that incorporate different levels of difficulty. We also introduce new datasets, sub-sampled from Flickr30k Entities and Visual Genome, that enable evaluations for the four conditions. Our experiments show that ZSGNet achieves state-of-the-art performance on Flickr30k and ReferIt under the usual "seen" settings and performs significantly better than baseline in the zero-shot setting.

show abstract

Section: Related Workmentioning

confidence: 99%

Zero-Shot Grounding of Objects From Natural Language Queries

Sadhu

Chen

Nevatia³

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

148

View full text Add to dashboard Cite

show abstract

“…• In a fully-observable context (De Vries et al 2017), it is usually given that every information about the context is shared among the agents. This makes common grounding easier because information about the context is already in their common ground, and there could be little chance of misunderstandings.…”

Section: A's View B's Viewmentioning

confidence: 99%

“…In this approach, dialogue state and utterance generation are learned directly from large raw corpora with little prior constraints, so they are more suitable for complex common grounding where flexibility is a requirement. However, few existing tasks focus on the difficulty of common grounding and most are based on either fully-observable or categorical context (De Vries et al 2017;Bordes and Weston 2016;Lewis et al 2017) where difficult common grounding is not required. A dataset closest to our setting is the Mutu-alFriends dataset (He et al 2017), which is based on the task of finding a mutual friend from private lists of friends.…”

Section: Related Workmentioning

confidence: 99%

A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context

Udagawa

Aizawa

2019

AAAI

View full text Add to dashboard Cite

Common grounding is the process of creating, repairing and updating mutual understandings, which is a critical aspect of sophisticated human communication. However, traditional dialogue systems have limited capability of establishing common ground, and we also lack task formulations which introduce natural difficulty in terms of common grounding while enabling easy evaluation and analysis of complex models. In this paper, we propose a minimal dialogue task which requires advanced skills of common grounding under continuous and partially-observable context. Based on this task formulation, we collected a largescale dataset of 6,760 dialogues which fulfills essential requirements of natural language corpora. Our analysis of the dataset revealed important phenomena related to common grounding that need to be considered. Finally, we evaluate and analyze baseline neural models on a simple subtask that requires recognition of the created common ground. We show that simple baseline models perform decently but leave room for further improvement. Overall, we show that our proposed task will be a fundamental testbed where we can train, evaluate, and analyze dialogue system's ability for sophisticated common grounding.

show abstract

“…GuessWhat?! [5] is a collaborative 2-player visual grounded object discovery game. The game begins with presenting an image I of a rich visual scene containing M objects C = {c m } M m=1 to both players, the questioner and the answerer.…”

Section: Related Workmentioning

confidence: 99%

“…Research on goal-oriented visual dialogue [1,5] has recently attracted lots of attention. Unlike the conventional VQA [8], where the robot answerer has to answer any question related to an input image raised by a human even if the question itself is ambiguous or indefinite, the goal-oriented visual dialogue extends the question-answering interactions to multiple rounds, turning the robot into also a questioner which can retrieve more information from the human.…”

Section: Introductionmentioning

confidence: 99%

Learning Goal-Oriented Visual Dialog Agents: Imitating and Surpassing Analytic Experts

Chang

Peng

2019

2019 IEEE International Conference on Multimedia and Expo (ICME)

View full text Add to dashboard Cite

This paper tackles the problem of learning a questioner in the goal-oriented visual dialog task. Several previous works adopt model-free reinforcement learning. Most pretrain the model from a finite set of human-generated data. We argue that using limited demonstrations to kick-start the questioner is insufficient due to the large policy search space. Inspired by a recently proposed information theoretic approach, we develop two analytic experts to serve as a source of highquality demonstrations for imitation learning. We then take advantage of reinforcement learning to refine the model towards the goal-oriented objective. Experimental results on the GuessWhat?! dataset show that our method has the combined merits of imitation and reinforcement learning, achieving the state-of-the-art performance.

show abstract

GuessWhat?! Visual Object Discovery through Multi-modal Dialogue

Cited by 321 publications

References 41 publications

Zero-Shot Grounding of Objects From Natural Language Queries

Zero-Shot Grounding of Objects From Natural Language Queries

A Natural Language Corpus of Common Grounding under Continuous and Partially-Observable Context

Learning Goal-Oriented Visual Dialog Agents: Imitating and Surpassing Analytic Experts

Contact Info

Product

Resources

About