2019
DOI: 10.1109/tpami.2018.2828437
|View full text |Cite
|
Sign up to set email alerts
|

Visual Dialog

Abstract: We introduce the task of Visual Dialog, which requires an AI agent to hold a meaningful dialog with humans in natural, conversational language about visual content. Specifically, given an image, a dialog history, and a question about the image, the agent has to ground the question in image, infer context from history, and answer the question accurately. Visual Dialog is disentangled enough from a specific downstream task so as to serve as a general test of machine intelligence, while being sufficiently grounde… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
9
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 31 publications
(9 citation statements)
references
References 44 publications
0
9
0
Order By: Relevance
“…In [1], the concept of AI-based visual chatbot was introduced. The bot taken into consideration is a mixture of typical chatbot and visual content, i.e., pictures.…”
Section: Figure 1 Banking T'aiomentioning
confidence: 99%
See 2 more Smart Citations
“…In [1], the concept of AI-based visual chatbot was introduced. The bot taken into consideration is a mixture of typical chatbot and visual content, i.e., pictures.…”
Section: Figure 1 Banking T'aiomentioning
confidence: 99%
“…Differently, [6] utilizes user model and communication resources in developing a deep reinforcement learning network for a large financial corporation to enhance its customers experience. The main disadvantages of using AI-based systems [1,7,9] are long training and validation processes and requiring high performance computing cluster. In addition, for languages like Vietnamese, there are many ways to construct a question and a question can be understood based on conversational context, thus, identifying if a sentence is a question is always a challenging task.…”
Section: Figure 1 Banking T'aiomentioning
confidence: 99%
See 1 more Smart Citation
“…If the context sentence number N c is less than 99, we will randomly sample another 99 − N c sentences from the whole rest target corpus. As for the data format, we follow most of Das et al (2019).…”
Section: Datasets and Setupsmentioning
confidence: 99%
“…Driven by the rapid growth of computer vision and natural language processing technologies, in recent years there has been a growing interest in multimodal intelligent tasks that require the ability to concurrently understand various forms of input data such as images and text. Some of the typical multimodal intelligent tasks are vision question answering (VQA) [1] that generate answers to natural language questions on the image presented, visual dialog [2] that holds a meaningful question and answer (Q&A) dialog on the input image, and image/video captioning that generates texts describing the contents of the input image or video. More advanced multimodal intelligent tasks have also been presented, including embodied question answering (EQA) [3], assuming an embodied agent moving around in a virtual environment [3], interactive question answering (IQA) [4], cooperative vision and dialog navigation (CVDN) [5], remote embodied visual referring expression in real indoor environments (REVERIE) [6], and vision and language navigation (VLN) [7].…”
Section: Introductionmentioning
confidence: 99%