Scene Graph Parsing by Attention Graph

Andrews, Martin; Chia, Yew Ken; Witteveen, Sam

doi:10.48550/arxiv.1909.06273

Cited by 3 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Grounding a scene graph with an image or image description can be beneficial for a variety of downstream tasks, such as image retrieval (Andrews et al, 2019;Johnson et al, 2015), image caption evaluation (Anderson et al, 2016) and image captioning (Zhong et al, 2020). Currently, there are three main research directions to scene graph parsing: those that focus on parsing images (Zellers et al, 2018;Tang et al, 2020;Xu et al, 2017;Zhang et al, 2019a;Cong et al, 2022;Li et al, 2022), text (Anderson et al, 2016;Schuster et al, 2015;Wang et al, 2018;Choi et al, 2022;Andrews et al, 2019;Sharifzadeh et al, 2022), or both modalities (Zhong et al, 2021;Sharifzadeh et al, 2022) into scene graphs. Parsing images involves utilizing an object detection model to identify the location and class of objects, as well as classifiers to determine the relationships and attributes of the objects.…”

Section: Related Workmentioning

confidence: 99%

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

Li¹,

Chai²,

Yue³

et al. 2023

Findings of the Association for Computational Linguistics: ACL 2023

View full text Add to dashboard Cite

Textual scene graph parsing has become increasingly important in various vision-language applications, including image caption evaluation and image retrieval. However, existing scene graph parsers that convert image captions into scene graphs often suffer from two types of errors. First, the generated scene graphs fail to capture the true semantics of the captions or the corresponding images, resulting in a lack of faithfulness. Second, the generated scene graphs have high inconsistency, with the same semantics represented by different annotations.To address these challenges, we propose a novel dataset, which involves re-annotating the captions in Visual Genome (VG) using a new intermediate representation called FACTUAL-MR. FACTUAL-MR can be directly converted into faithful and consistent scene graph annotations. Our experimental results clearly demonstrate that the parser trained on our dataset outperforms existing approaches in terms of faithfulness and consistency. This improvement leads to a significant performance boost in both image caption evaluation and zero-shot image retrieval tasks. Furthermore, we introduce a novel metric for measuring scene graph similarity, which, when combined with the improved scene graph parser, achieves stateof-the-art (SOTA) results on multiple benchmark datasets for the aforementioned tasks. The code and dataset are available at https: //github.com/zhuang-li/FACTUAL.

show abstract

Section: Related Workmentioning

confidence: 99%

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

Li¹,

Chai²,

Yue³

et al. 2023

Findings of the Association for Computational Linguistics: ACL 2023

View full text Add to dashboard Cite

show abstract

“…Contextual information was also used in [125] where a Relation Proposal Network (RePN) is proposed to deal with the dimensionality problem of object relations, drastically reducing the number of relations that actually need to be accounted for. As with many other areas, Transformers have also been successfully used in generating scene graphs with competent results [126], [127]. A fully convolutional SGG method was proposed in [128], showing that using a pre-trained detector is not necessary and good results can also be obtained, even high zero-shot recall.…”

Section: B Scene-graph-based Knowledge Representationmentioning

confidence: 99%

From a Visual Scene to a Virtual Representation: A Cross-Domain Review

et al. 2023

View full text Add to dashboard Cite

The widespread use of smartphones and other low-cost equipment as recording devices, the massive growth in bandwidth, and the ever-growing demand for new applications with enhanced capabilities, made visual data a must in several scenarios, including surveillance, sports, retail, entertainment, and intelligent vehicles. Despite significant advances in analyzing and extracting data from images and video, there is a lack of solutions able to analyze and semantically describe the information in the visual scene so that it can be efficiently used and repurposed. Scientific contributions have focused on individual aspects or addressing specific problems and application areas, and no cross-domain solution is available to implement a complete system that enables information passing between cross-cutting algorithms. This paper analyses the problem from an end-to-end perspective, i.e., from the visual scene analysis to the representation of information in a virtual environment, including how the extracted data can be described and stored. A simple processing pipeline is introduced to set up a structure for discussing challenges and opportunities in different steps of the entire process, allowing to identify current gaps in the literature. The work reviews various technologies specifically from the perspective of their applicability to an endto-end pipeline for scene analysis and synthesis, along with an extensive analysis of datasets for relevant tasks.

show abstract

“…Early approaches to this problem use a dependency parser as a basis for the SG prediction (Schuster et al 2015;Wang et al 2018). Recently, Andrews, Chia, and Witteveen (2019) proposed to train a transformer model on a parallel dataset of image region descriptions and scene graphs, taken from the Visual Genome dataset (Krishna et al 2017).…”

Section: Related Workmentioning

confidence: 99%

Learning Object Detection from Captions via Textual Scene Attributes

Jerbi¹,

Herzig²,

Berant³

et al. 2020

Preprint

View full text Add to dashboard Cite

Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detection, captions have only been used to infer the categories of the objects in the image. In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations. Namely, the text represents a scene of the image, as described recently in the literature. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets, outperforming recent approaches.

show abstract

Scene Graph Parsing by Attention Graph

Cited by 3 publications

References 19 publications

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

FACTUAL: A Benchmark for Faithful and Consistent Textual Scene Graph Parsing

From a Visual Scene to a Virtual Representation: A Cross-Domain Review

Learning Object Detection from Captions via Textual Scene Attributes

Contact Info

Product

Resources

About