What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog.

Shukla, Pushkar; Elmadjian, Carlos; Sharan, Richika; Kulkarni, Vivek; Turk, Matthew; Wang, William Yang

doi:10.18653/v1/p19-1646

Cited by 45 publications

(46 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The need of going beyond this metric to evaluate the quality of the dialogues has already been observed. So far attention has been put on the linguistic skills of the models (Shukla et al, 2019;Shekhar et al, 2019) and their dialogue strategies (Shekhar et al, 2018;Pang and Wang, 2020). But still the models are evaluated without considering how much each question contributes to the goal.…”

Section: Introductionmentioning

confidence: 99%

On the role of effective and referring questions in GuessWhat?!

Mazuecos¹,

Testoni²,

Bernardi³

et al. 2020

Proceedings of the First Workshop on Advances in Language and Vision Research

View full text Add to dashboard Cite

Task success is the standard metric used to evaluate referential visual dialogue systems.In this paper we propose two new metrics that evaluate how each question contributes to the goal. First, we measure how effective each question is by evaluating whether the question discards objects that are not the referent. Second, we define referring questions as those that univocally identify one object in the image. We report the new metrics for human dialogues and for state of the art publicly available models on GuessWhat?!. Regarding our first metric, we find that successful dialogues do not have a higher percentage of effective questions for most models. With respect to the second metric, humans make questions at the end of the dialogue that are referring, confirming their guess before guessing. Human dialogues that use this strategy have a higher task success but models do not seem to learn it.

show abstract

Section: Introductionmentioning

confidence: 99%

On the role of effective and referring questions in GuessWhat?!

Mazuecos¹,

Testoni²,

Bernardi³

et al. 2020

Proceedings of the First Workshop on Advances in Language and Vision Research

View full text Add to dashboard Cite

show abstract

“…Recent years have witnessed an increasing attention in visually grounded dialogues (Zarrieß et al, 2016;de Vries et al, 2018;Alamri et al, 2019;Narayan-Chen et al, 2019). Despite the impressive progress on benchmark scores and model architec-tures (Das et al, 2017b;Wu et al, 2018;Kottur et al, 2018;Gan et al, 2019;Shukla et al, 2019;Niu et al, 2019;Zheng et al, 2019;Kang et al, 2019;Murahari et al, 2019;Pang and Wang, 2020), there have also been critical problems pointed out in terms of dataset biases (Goyal et al, 2017;Chattopadhyay et al, 2017;Massiceti et al, 2018;Chen et al, 2018;Kottur et al, 2019;Kim et al, 2020;Agarwal et al, 2020) which obscure such contributions. For instance, Cirik et al (2018) points out that existing dataset of reference resolution may be largely solvable without recognizing the full referring expressions (e.g.…”

Section: Related Workmentioning

confidence: 99%

A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions

Udagawa

Yamazaki

Aizawa

2020

Findings of the Association for Computational Linguistics: EMNLP 2020

View full text Add to dashboard Cite

Recent models achieve promising results in visually grounded dialogues. However, existing datasets often contain undesirable biases and lack sophisticated linguistic analyses, which make it difficult to understand how well current models recognize their precise linguistic structures. To address this problem, we make two design choices: first, we focus on OneCommon Corpus (Udagawa and Aizawa, 2019, 2020), a simple yet challenging common grounding dataset which contains minimal bias by design. Second, we analyze their linguistic structures based on spatial expressions and provide comprehensive and reliable annotation for 600 dialogues. We show that our annotation captures important linguistic structures including predicate-argument structure, modification and ellipsis. In our experiments, we assess the model's understanding of these structures through reference resolution. We demonstrate that our annotation can reveal both the strengths and weaknesses of baseline models in essential levels of detail. Overall, we propose a novel framework and resource for investigating fine-grained language understanding in visually grounded dialogues.

show abstract

“…Zhang et al [26] designed a fine-grained reward mechanism based on the information provided by Oracle and Guesser. Some researchers explored the use of information uncertainty or changes to generate valuable questions [2,11,20].…”

Section: Related Workmentioning

confidence: 99%

“…We make comparisons in supervised training fashion and advanced training fashion (includes reinforcement learning and cooperative learning) respectively. The 3 supervised models are: the baseline SL [6], the DM [18] and the current state-of-the-art model VDST-SL [13]; 9 advanced training models are: baseline RL [22], GDSE-C [19], TPG [27], VQG [26], ISM [1], Bayesian [2], RIG as rewards (RIG-1), RIG as a loss with 0-1 rewards (RIG-2) [20] and the current state-of-the-art model VDST-RL [13].…”

Section: Evaluation Metric and Comparison Modelsmentioning

confidence: 99%

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

Feng

Wang

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

A goal-oriented visual dialogue involves multi-turn interactions between two agents, Questioner and Oracle. During which, the answer given by Oracle is of great significance, as it provides golden response to what Questioner concerns. Based on the answer, Questioner updates its belief on target visual content and further raises another question. Notably, different answers drive into different visual beliefs and future questions. However, existing methods always indiscriminately encode answers after much longer questions, resulting in a weak utilization of answers. In this paper, we propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states. First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answerdriven effect on visual attention by sharpening question-related attention and adjusting it by answer-based logical operation at each turn. Then based on the focusing attention, we get the visual state estimation by Conditional Visual Information Fusion (CVIF), where overall information and difference information are fused conditioning on the question-answer state. We evaluate the proposed ADVSE to both question generator and guesser tasks on the large-scale GuessWhat?! dataset and achieve the state-of-the-art performances on both tasks. The qualitative results indicate that the ADVSE boosts the agent to generate highly efficient questions and obtains reliable visual attentions during the reasonable question generation and guess processes. CCS CONCEPTS • Computing methodologies → Computer vision tasks; Discourse, dialogue and pragmatics; Natural language generation; Computer vision representations.

show abstract

What Should I Ask? Using Conversationally Informative Rewards for Goal-oriented Visual Dialog.

Cited by 45 publications

References 11 publications

On the role of effective and referring questions in GuessWhat?!

On the role of effective and referring questions in GuessWhat?!

A Linguistic Analysis of Visually Grounded Dialogues Based on Spatial Expressions

Answer-Driven Visual State Estimator for Goal-Oriented Visual Dialogue

Contact Info

Product

Resources

About