NODIS: Neural Ordinary Differential Scene Understanding

Yang, Cong; Ackermann, Hanno; Liao, Wuping; Yang, Michael Ying; Rosenhahn, Bodo

doi:10.1007/978-3-030-58565-5_38

Cited by 10 publications

(5 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The performance of EDET in predicate classification is shown in Table 4 and Figure 9. [22] 58.5 65.2 67.1 NODIS [23] 58.9 66.0 67.9 VC-Tree [24] 59.8 66.2 67.9 GPS-Net [25] 60 The results show that EDET can generate excellent scene parsing in the scene graph predicate classification task. R@K means the recall rate of the top K prediction results.…”

Section: Predicate Classificationmentioning

confidence: 98%

EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing

Wan

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

In scene parsing, the model is required to be able to process complex multi-modal data such as images and contexts in real scenes, and discover their implicit connections from objects existing in the scene. As a storage method that contains entity information and the relationship between entities, a knowledge graph can well express objects and the semantic relationship between objects in the scene. In this paper, a new multi-phase process was proposed to solve scene parsing tasks; first, a knowledge graph was used to align the multi-modal information and then the graph-based model generates results. We also designed an experiment of feature engineering’s validation for a deep-learning model to preliminarily verify the effectiveness of this method. Hence, we proposed a knowledge representation method named Entity Descriptor Encoder of Transformer (EDET), which uses both the entity itself and its internal attributes for knowledge representation. This method can be embedded into the transformer structure to solve multi-modal scene parsing tasks. EDET can aggregate the multi-modal attributes of entities, and the results in the scene graph generation and image captioning tasks prove that EDET has excellent performance in multi-modal fields. Finally, the proposed method was applied to the industrial scene, which confirmed the viability of our method.

show abstract

Section: Predicate Classificationmentioning

confidence: 98%

EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing

Wan

et al. 2023

Applied Sciences

View full text Add to dashboard Cite

show abstract

“…The applications include image retrieval [46], image captioning [1,45], VQA [51,25] and image generation [24,19]. In order to generate high-quality scene graphs from images, a series of works explore different directions such as utilizing spatial context [61,65,40], graph structure [60,58,34], optimization [8], reinforcement learning [36,51], semi-supervised training [7] or a contrastive loss [66]. These works have achieved excellent results on image datasets [29,42,31].…”

Section: Related Workmentioning

confidence: 99%

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

Yang

Liao

Ackermann

et al. 2021

2021 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

107

114

View full text Add to dashboard Cite

Dynamic scene graph generation aims at generating a scene graph of the given video. Compared to the task of scene graph generation from images, it is more challenging because of the dynamic relationships between objects and the temporal dependencies between frames allowing for a richer semantic interpretation. In this paper, we propose Spatial-temporal Transformer (STTran), a neural network that consists of two core modules: (1) a spatial encoder that takes an input frame to extract spatial context and reason about the visual relationships within a frame, and (2) a temporal decoder which takes the output of the spatial encoder as input in order to capture the temporal dependencies between frames and infer the dynamic relationships. Furthermore, STTran is flexible to take varying lengths of videos as input without clipping, which is especially important for long videos. Our method is validated on the benchmark dataset Action Genome (AG). The experimental results demonstrate the superior performance of our method in terms of dynamic scene graphs. Moreover, a set of ablative studies is conducted and the effect of each proposed module is justified. Code available at: https://github.com/yrcong/STTran.

show abstract

“…The source code is made publicly available on Github. Now many models [32], [33], [34], [35], [36], [37] are available to generate scene graphs from different perspectives, and some works even extend the scene graph generation task from images to videos [38], [39], [40], [41]. Two-stage methods following [2] are currently dominating scene graph generation: several works [9], [32], [42], [43] use residual neural networks with the global context to improve the quality of the generated scene graphs.…”

Section: Scene Graph Generationmentioning

confidence: 99%

“…Now many models [32], [33], [34], [35], [36], [37] are available to generate scene graphs from different perspectives, and some works even extend the scene graph generation task from images to videos [38], [39], [40], [41]. Two-stage methods following [2] are currently dominating scene graph generation: several works [9], [32], [42], [43] use residual neural networks with the global context to improve the quality of the generated scene graphs. Xu et al [42] use standard RNNs to iteratively improves the relationship prediction via message passing while MotifNet [9] stacks LSTMs to reason about the local and global context.…”

Section: Scene Graph Generationmentioning

confidence: 99%

RelTR: Relation Transformer for Scene Graph Generation

Yang¹,

Yang²,

Rosenhahn³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Different objects in the same scene are more or less related to each other, but only a limited number of these relationships are noteworthy. Inspired by DETR, which excels in object detection, we view scene graph generation as a set prediction problem and propose an end-to-end scene graph generation model RelTR which has an encoder-decoder architecture. The encoder reasons about the visual feature context while the decoder infers a fixed-size set of triplets subject-predicate-object using different types of attention mechanisms with coupled subject and object queries. We design a set prediction loss performing the matching between the ground truth and predicted triplets for the end-to-end training. In contrast to most existing scene graph generation methods, RelTR is a one-stage method that predicts a set of relationships directly only using visual appearance without combining entities and labeling all possible predicates. Extensive experiments on the Visual Genome and Open Images V6 datasets demonstrate the superior performance and fast inference of our model.

show abstract

NODIS: Neural Ordinary Differential Scene Understanding

Cited by 10 publications

References 46 publications

EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing

EDET: Entity Descriptor Encoder of Transformer for Multi-Modal Knowledge Graph in Scene Parsing

Spatial-Temporal Transformer for Dynamic Scene Graph Generation

RelTR: Relation Transformer for Scene Graph Generation

Contact Info

Product

Resources

About