BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Lê, Hung; Chen, Nancy F.; Hoi, Steven C. H.

doi:10.18653/v1/2020.emnlp-main.145

Cited by 25 publications

(16 citation statements)

References 45 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Numerous approaches to video-grounded dialogue have shown remarkable performance in build-ing intelligent multimodal systems Schwartz et al, 2019;Le et al, 2019;Le et al, 2020). However, most of these methods exhibit marginal performance gain, and our ability to understand their limitations is impeded by the complexity of the task.…”

Section: Introductionmentioning

confidence: 99%

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

Lê

Sankar

Moon

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Videogrounded Dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from 11k CATER synthetic videos and contains 10 instances of 10-round dialogues for each video, resulting in more than 100k dialogues and 1M question-answer pairs. Our code and dataset are publicly available 1 .

show abstract

Section: Introductionmentioning

confidence: 99%

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

Lê

Sankar

Moon

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

show abstract

“…However, application in dynamic domains would involve additional complexities that need to be taken into account, such as the dependencies on previous common ground. Finally, we are planning to study wider variety of model architectures and pretraining datasets, including video-processing methods (Carreira and Zisserman, 2017;Wang et al, 2018), visionlanguage grounding models (Lu et al, 2019;Le et al, 2020), and large-scale, open domain datasets (Krishna et al, 2017b;Sharma et al, 2018). Note that the entity-level representation of the observation (required in our baseline) can be obtained from raw video features, for example, by utilizing the object trackers (Bergmann et al, 2019;.…”

Section: Discussionmentioning

confidence: 99%

Maintaining Common Ground in Dynamic Environments

Udagawa

Aizawa

2021

Transactions of the Association for Computational Linguistics

View full text Add to dashboard Cite

Common grounding is the process of creating and maintaining mutual understandings, which is a critical aspect of sophisticated human communication. While various task settings have been proposed in existing literature, they mostly focus on creating common ground under a static context and ignore the aspect of maintaining them overtime under dynamic context. In this work, we propose a novel task setting to study the ability of both creating and maintaining common ground in dynamic environments. Based on our minimal task formulation, we collected a large-scale dataset of 5,617 dialogues to enable fine-grained evaluation and analysis of various dialogue systems. Through our dataset analyses, we highlight novel challenges introduced in our setting, such as the usage of complex spatio-temporal expressions to create and maintain common ground. Finally, we conduct extensive experiments to assess the capabilities of our baseline dialogue system and discuss future prospects of our research.

show abstract

“…However, application in dynamic domains would involve additional complexities that need to be taken into account, such as the dependencies on previous common ground. Finally, we're planning to study wider variety of model architectures and pretraining datasets, including video-processing methods (Carreira and Zisserman, 2017;Wang et al, 2018), vision-language grounding models (Lu et al, 2019;Le et al, 2020) and large-scale, open domain datasets (Krishna et al, 2017b;Sharma et al, 2018). Note that the entity-level representation of the observation (required in our baseline) can be obtained from raw video features, e.g.…”

Section: Discussionmentioning

confidence: 99%

Maintaining Common Ground in Dynamic Environments

Udagawa¹,

Aizawa²

2021

Preprint

View full text Add to dashboard Cite

Common grounding is the process of creating and maintaining mutual understandings, which is a critical aspect of sophisticated human communication. While various task settings have been proposed in existing literature, they mostly focus on creating common ground under static context and ignore the aspect of maintaining them overtime under dynamic context. In this work, we propose a novel task setting to study the ability of both creating and maintaining common ground in dynamic environments. Based on our minimal task formulation, we collected a large-scale dataset of 5,617 dialogues to enable fine-grained evaluation and analysis of various dialogue systems. Through our dataset analyses, we highlight novel challenges introduced in our setting, such as the usage of complex spatio-temporal expressions to create and maintain common ground. Finally, we conduct extensive experiments to assess the capabilities of our baseline dialogue system and discuss future prospects of our research.

show abstract

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Cited by 25 publications

References 45 publications

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue

Maintaining Common Ground in Dynamic Environments

Maintaining Common Ground in Dynamic Environments

Contact Info

Product

Resources

About