Unbiased Scene Graph Generation From Biased Training

Tang, Kaihua; Niu, Yulei; Huang, Jianqiang; Shi, Jiaxin; Zhang, Hanwang

doi:10.1109/cvpr42600.2020.00377

Cited by 570 publications

(563 citation statements)

References 44 publications

Supporting

Mentioning

492

Contrasting

Order By: Relevance

“…We provide additional experimental results on Visual Genome (VG) dataset [52]. We follow [19], [21], [29] to adopt the most widely-used dataset split which consists of 108K images and includes the most frequent 150 object classes and 50 predicates. When evaluating visual relationship detection/scene graph generation on VG, there are three common evaluation modes including (1) Predicate Classification (PredCls): ground truth bounding boxes and object labels are given, (2) Scene Graph Classification (SGCls): only ground truth boxes given, and (3) Scene Graph Detection (SGDet): nothing other than input images is given.…”

Section: A Results On Visual Genomementioning

confidence: 99%

“…Both SMN [21] and KERN [25] exploit this property and use the frequency bias and object cooccurrence, respectively. However, the usage of bias could reversely undermine the capability of generalization which has been demonstrated by comparing mean recall in recent works (e.g., [29]).…”

Section: A Results On Visual Genomementioning

confidence: 99%

“…It is a very crucial task for enabling an intelligent system to understand the content of images, and has received much attention over the past few years [1]- [18]. Based on VRD, Xu et al [19] proposed scene graph generation (SGG) [20]- [29], which targets at extracting a comprehensive and symbolic graph representation in an image, with vertices and edges denoting instances and for visual relationships respectively. We focus on and use the term VRD throughout this paper for consistency.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

2021

View full text Add to dashboard Cite

Visual relationship detection aims to reason over relationships among salient objects in images, which has drawn increasing attention over the past few years. Inspired by human reasoning mechanisms, it is believed that external visual commonsense knowledge is beneficial for reasoning visual relationships of objects in images, which is however rarely considered in existing methods. In this paper, we propose a novel approach named Relational Visual-Linguistic Bidirectional Encoder Representations from Transformers (RVL-BERT), which performs relational reasoning with both visual and language commonsense knowledge learned via self-supervised pre-training with multimodal representations. RVL-BERT also uses an effective spatial module and a novel mask attention module to explicitly capture spatial information among the objects. Moreover, our model decouples object detection from visual relationship recognition by taking in object names directly, enabling it to be used on top of any object detection system. We show through quantitative and qualitative experiments that, with the transferred knowledge and novel modules, RVL-BERT achieves competitive results on two challenging visual relationship detection datasets.

show abstract

Section: A Results On Visual Genomementioning

confidence: 99%

Section: A Results On Visual Genomementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

2021

View full text Add to dashboard Cite

show abstract

“…Given an image-text pair (I, w), we first extract the visual scene graph G from the image with an off-the-shelf scene graph generator (Tang et al, 2020). A scene graph is a directed graph with the nodes representing the objects and the edges depicting their pairwise relationships.…”

Section: Cross-modal Alignment With Visual Scene Graph Encodingmentioning

confidence: 99%

“…In implementation, we first embed tokens in both the text sequence w and scene graph triplets (extracted by SGG (Tang et al, 2020)) with a pretrained BERT embedder (Devlin et al, 2019). We then extract the visual embedding of each image region and also the union region of each triplet with the Faster R-CNN component (Ren et al, 2015) used in the bottom-up-attention (Anderson et al, 2018).…”

Section: Cross-modal Alignment With Visual Scene Graph Encodingmentioning

confidence: 99%

Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA

Ding¹,

Li²,

Hu³

et al. 2021

Proceedings of the Third Workshop on Multimodal Artificial Intelligence

View full text Add to dashboard Cite

Recent vision-language understanding approaches adopt a multi-modal transformer pretraining and finetuning paradigm. Prior work learns representations of text tokens and visual features with cross-attention mechanisms and captures the alignment solely based on indirect signals. In this work, we propose to enhance the alignment mechanism by incorporating image scene graph structures as the bridge between the two modalities, and learning with new contrastive objectives. In our preliminary study on the challenging compositional visual question answering task, we show the proposed approach achieves improved results, demonstrating potentials to enhance visionlanguage understanding.

show abstract

Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation

Wang

Shan

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Unbiased Scene Graph Generation From Biased Training

Cited by 570 publications

References 44 publications

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

Visual Relationship Detection With Visual-Linguistic Knowledge From Multimodal Representations

Semantic Aligned Multi-modal Transformer for Vision-LanguageUnderstanding: A Preliminary Study on Visual QA

Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation

Contact Info

Product

Resources

About