2021
DOI: 10.48550/arxiv.2109.08475
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

GoG: Relation-aware Graph-over-Graph Network for Visual Dialog

Abstract: Visual dialog, which aims to hold a meaningful conversation with humans about a given image, is a challenging task that requires models to reason the complex dependencies among visual content, dialog history, and current questions. Graph neural networks are recently applied to model the implicit relations between objects in an image or dialog. However, they neglect the importance of 1) coreference relations among dialog history and dependency relations between words for the question representation; and 2) the … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 38 publications
0
2
0
Order By: Relevance
“…Furthermore, the alignment-based approach model (Chen et al 2022) has shown promise in explicitly aligning visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment. Another intriguing approach (Chen et al 2021;Guo et al 2020;Zhang et al 2022b;Zheng et al 2019) is the graph-based representation suitable for the composite scenario of dialog history and image, which offers a structured way to understand relationships within an image. Diverging from these methodologies, our model leverages a large multi-modal hierarchical context.…”
Section: Visual Dialogmentioning
confidence: 99%
See 1 more Smart Citation
“…Furthermore, the alignment-based approach model (Chen et al 2022) has shown promise in explicitly aligning visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment. Another intriguing approach (Chen et al 2021;Guo et al 2020;Zhang et al 2022b;Zheng et al 2019) is the graph-based representation suitable for the composite scenario of dialog history and image, which offers a structured way to understand relationships within an image. Diverging from these methodologies, our model leverages a large multi-modal hierarchical context.…”
Section: Visual Dialogmentioning
confidence: 99%
“…VisDial Similar to the previous work (Kang et al 2023), we compare the performance of our method with 10 baselines: 1) Attention-based models: CoAtt (Wu et al 2018), HCIAE (Lu et al 2017), Primary (Guo, Xu, and Tao 2019), ReDAN (Gan et al 2019), DMRM (Chen et al 2020a), DAM (Jiang et al 2020b) 2) Graph-based models: KBGN (Jiang et al 2020a), LTMI (Nguyen, Suganuma, and Okatani 2020), LTMI-GoG (Chen et al 2021) 3) Semi-supervised learning model: GST (Kang et al 2023).…”
Section: Baselinesmentioning
confidence: 99%