2019
DOI: 10.48550/arxiv.1909.06273
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Scene Graph Parsing by Attention Graph

Martin Andrews,
Yew Ken Chia,
Sam Witteveen

Abstract: Scene graph representations, which form a graph of visual object nodes together with their attributes and relations, have proved useful across a variety of vision and language applications. Recent work in the area has used Natural Language Processing dependency tree methods to automatically build scene graphs. In this work, we present an 'Attention Graph' mechanism that can be trained endto-end, and produces a scene graph structure that can be lifted directly from the top layer of a standard Transformer model.… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 19 publications
0
3
0
Order By: Relevance
“…Grounding a scene graph with an image or image description can be beneficial for a variety of downstream tasks, such as image retrieval (Andrews et al, 2019;Johnson et al, 2015), image caption evaluation (Anderson et al, 2016) and image captioning (Zhong et al, 2020). Currently, there are three main research directions to scene graph parsing: those that focus on parsing images (Zellers et al, 2018;Tang et al, 2020;Xu et al, 2017;Zhang et al, 2019a;Cong et al, 2022;Li et al, 2022), text (Anderson et al, 2016;Schuster et al, 2015;Wang et al, 2018;Choi et al, 2022;Andrews et al, 2019;Sharifzadeh et al, 2022), or both modalities (Zhong et al, 2021;Sharifzadeh et al, 2022) into scene graphs. Parsing images involves utilizing an object detection model to identify the location and class of objects, as well as classifiers to determine the relationships and attributes of the objects.…”
Section: Related Workmentioning
confidence: 99%
“…Grounding a scene graph with an image or image description can be beneficial for a variety of downstream tasks, such as image retrieval (Andrews et al, 2019;Johnson et al, 2015), image caption evaluation (Anderson et al, 2016) and image captioning (Zhong et al, 2020). Currently, there are three main research directions to scene graph parsing: those that focus on parsing images (Zellers et al, 2018;Tang et al, 2020;Xu et al, 2017;Zhang et al, 2019a;Cong et al, 2022;Li et al, 2022), text (Anderson et al, 2016;Schuster et al, 2015;Wang et al, 2018;Choi et al, 2022;Andrews et al, 2019;Sharifzadeh et al, 2022), or both modalities (Zhong et al, 2021;Sharifzadeh et al, 2022) into scene graphs. Parsing images involves utilizing an object detection model to identify the location and class of objects, as well as classifiers to determine the relationships and attributes of the objects.…”
Section: Related Workmentioning
confidence: 99%
“…Contextual information was also used in [125] where a Relation Proposal Network (RePN) is proposed to deal with the dimensionality problem of object relations, drastically reducing the number of relations that actually need to be accounted for. As with many other areas, Transformers have also been successfully used in generating scene graphs with competent results [126], [127]. A fully convolutional SGG method was proposed in [128], showing that using a pre-trained detector is not necessary and good results can also be obtained, even high zero-shot recall.…”
Section: B Scene-graph-based Knowledge Representationmentioning
confidence: 99%
“…Early approaches to this problem use a dependency parser as a basis for the SG prediction (Schuster et al 2015;Wang et al 2018). Recently, Andrews, Chia, and Witteveen (2019) proposed to train a transformer model on a parallel dataset of image region descriptions and scene graphs, taken from the Visual Genome dataset (Krishna et al 2017).…”
Section: Related Workmentioning
confidence: 99%