2023
DOI: 10.1109/tmm.2022.3169065
|View full text |Cite
|
Sign up to set email alerts
|

Scene Graph Refinement Network for Visual Question Answering

Abstract: Existing Visual Question Answering (VQA) models have explored various visual relationships between objects in the image to answer complex questions, which inevitably introduces irrelevant information brought by inaccurate object detection and text grounding. To address the problem, we propose a Question-Driven Graph Fusion Network (QD-GFN). It first models semantic, spatial, and implicit visual relations in images by three graph attention networks, then question information is utilized to guide the aggregation… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
10
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 30 publications
(10 citation statements)
references
References 65 publications
0
10
0
Order By: Relevance
“…Video Encoder. We uniformly sample T frames from V and extract their CNN (e.g., I3D (Qian et al 2023)) features. These features are contextually encoded using a video encoder ϕ v to yield frame features ϕ v (V ) = {v 1 , v 2 , .…”
Section: Pseudo-supervised Setupmentioning
confidence: 99%
See 1 more Smart Citation
“…Video Encoder. We uniformly sample T frames from V and extract their CNN (e.g., I3D (Qian et al 2023)) features. These features are contextually encoded using a video encoder ϕ v to yield frame features ϕ v (V ) = {v 1 , v 2 , .…”
Section: Pseudo-supervised Setupmentioning
confidence: 99%
“…Natural Language Video Localization (NLVL) is a fundamental multimodal understanding task that aims to align textual queries with relevant video segments. NLVL is a core component for various applications such as video moment retrieval (Cao et al 2022), video question answering (Qian et al 2023;Lei et al 2020a), and video editing (Gao et al 2022). Prior works have primarily explored supervised (Zeng et al 2020;Wang, Ma, and Jiang 2020;Soldan et al 2021;Liu et al 2021;Yu et al 2020; or weakly supervised (Mun, Cho, and Han 2020;Zhang et al 2020Zhang et al , 2021) NLVL methodologies, relying on annotated video-query data to various extents.…”
Section: Introductionmentioning
confidence: 99%
“…Liu et al [ 15 ] used the RGB-D information of images to represent inaccurate depth features for extracting semantic information. Qian et al [ 6 ] propose to refine the scene graphs for improving the effectiveness and present a scene graph refinement network (SGR), which introduces a transformer-based refinement network to enhance the object and relation features for better classification. Wu et al [ 21 ] propose to enhance video captioning with deep-level object relationships that are adaptively explored during training and present a transitive visual relationship detection (TVRD) module.…”
Section: Related Workmentioning
confidence: 99%
“…Visual relationships are usually expressed as triples <subject–predicate–object> [ 3 , 4 , 5 ]. They play an essential role in higher-level vision tasks, such as visual question answering [ 6 ], image captioning [ 7 ], and image generation [ 8 ]. There are many promising results in visual relationship detection works.…”
Section: Introductionmentioning
confidence: 99%
“…The ultimate goal of scene graph generation is to produce such representations from raw images (or videos [14]). Such representations have proved valuable for a variety of higher-level AI tasks, e.g., image/video captioning [24,31], and visual question answering [34]. Scene graph generation models usually contain two message passing networks: one for object detection, and the other for relation prediction among the detected objects.…”
Section: Introductionmentioning
confidence: 99%