Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) 2020
DOI: 10.18653/v1/2020.emnlp-main.145
|View full text |Cite
|
Sign up to set email alerts
|

BiST: Bi-directional Spatio-Temporal Reasoning for Video-Grounded Dialogues

Abstract: Video-grounded dialogues are very challenging due to (i) the complexity of videos which contain both spatial and temporal variations, and (ii) the complexity of user utterances which query different segments and/or different objects in videos over multiple dialogue turns. However, existing approaches to video-grounded dialogues often focus on superficial temporal-level visual cues, but neglect more fine-grained spatial signals from videos. To address this drawback, we propose Bi-directional Spatio-Temporal Lea… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
16
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4
1

Relationship

2
8

Authors

Journals

citations
Cited by 25 publications
(16 citation statements)
references
References 45 publications
0
16
0
Order By: Relevance
“…Numerous approaches to video-grounded dialogue have shown remarkable performance in build-ing intelligent multimodal systems Schwartz et al, 2019;Le et al, 2019;Le et al, 2020). However, most of these methods exhibit marginal performance gain, and our ability to understand their limitations is impeded by the complexity of the task.…”
Section: Introductionmentioning
confidence: 99%
“…Numerous approaches to video-grounded dialogue have shown remarkable performance in build-ing intelligent multimodal systems Schwartz et al, 2019;Le et al, 2019;Le et al, 2020). However, most of these methods exhibit marginal performance gain, and our ability to understand their limitations is impeded by the complexity of the task.…”
Section: Introductionmentioning
confidence: 99%
“…However, application in dynamic domains would involve additional complexities that need to be taken into account, such as the dependencies on previous common ground. Finally, we are planning to study wider variety of model architectures and pretraining datasets, including video-processing methods (Carreira and Zisserman, 2017;Wang et al, 2018), visionlanguage grounding models (Lu et al, 2019;Le et al, 2020), and large-scale, open domain datasets (Krishna et al, 2017b;Sharma et al, 2018). Note that the entity-level representation of the observation (required in our baseline) can be obtained from raw video features, for example, by utilizing the object trackers (Bergmann et al, 2019;.…”
Section: Discussionmentioning
confidence: 99%
“…However, application in dynamic domains would involve additional complexities that need to be taken into account, such as the dependencies on previous common ground. Finally, we're planning to study wider variety of model architectures and pretraining datasets, including video-processing methods (Carreira and Zisserman, 2017;Wang et al, 2018), vision-language grounding models (Lu et al, 2019;Le et al, 2020) and large-scale, open domain datasets (Krishna et al, 2017b;Sharma et al, 2018). Note that the entity-level representation of the observation (required in our baseline) can be obtained from raw video features, e.g.…”
Section: Discussionmentioning
confidence: 99%