Spatial-Temporal Transformer for Dynamic Scene Graph Generation

Yang, Cong; Liao, Wuping; Ackermann, Hanno; Rosenhahn, Bodo; Yang, Michael Ying

doi:10.1109/iccv48922.2021.01606

Cited by 107 publications

(117 citation statements)

References 61 publications

Supporting

Mentioning

117

Contrasting

Order By: Relevance

“…Works using a hybrid ConvLSTM [149] are also found [62], [123]. Finally, in some instances, networks pre-trained to perform an auxiliary task (regarded as experts) are used to pre-process the input and provide specific information that can be leveraged by the Transformer [66], [131]. Some examples include object detection [80], action features [13], or scene, motion, OCR and facial features, among others [104].…”

Section: Embeddingmentioning

confidence: 99%

“…Local restriction approaches reduce computational complexity from O(T 2 ) to O(T • N ), where N is the size of the neighborhood. One set of works [9], [99], [99], [119], [131], define the neighborhoods by sampling nearby tokens given a query, similar to the sliding window approach in the NLP Longformer [153]. Importantly, in [99], the [CLS] token does perform all-to-all attention.…”

Section: Restricted Approachesmentioning

confidence: 99%

“…Venn diagram displaying our proposed taxonomy of efficient VT designs (best viewed in color). TS (TimeSformer) and TSx[9], STVG-BERT[117], SCT[90], AVT[80], ViViT[11], SAVM[60], LVT[61], HERO[14], VideoSwin[12], VATT[49], MViT[48], COOT[71], SMT[134], Perceiver[130], STTran[131], VATNet[10], MART[58], HISAN[81], Dyadformer[142], PE[64], VTN[99], VMTN[129], PCSA[119], MDAM[55], VATT[49], TrDIMP[82], PMPNet[42], and Transfuser[132]. We describe Local, Axial and Sparse approaches in Sec.3.2.1, Hierarchical and Small-Q(ueries) in Sec.…”

mentioning

confidence: 99%

See 2 more Smart Citations

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

Transformer models have shown great success modeling long-range interactions. Nevertheless, they scale quadratically with input length and lack inductive biases. These limitations can be further exacerbated when dealing with the high dimensionality of video. Proper modeling of video, which can span from seconds to hours, requires handling long-range interactions. This makes Transformers a promising tool for solving video related tasks, but some adaptations are required. While there are previous works that study the advances of Transformers for vision tasks, there is none that focus on in-depth analysis of video-specific designs. In this survey we analyse and summarize the main contributions and trends for adapting Transformers to model video data. Specifically, we delve into how videos are embedded and tokenized, finding a very widspread use of large CNN backbones to reduce dimensionality and a predominance of patches and frames as tokens. Furthermore, we study how the Transformer layer has been tweaked to handle longer sequences, generally by reducing the number of tokens in single attention operation. Also, we analyse the self-supervised losses used to train Video Transformers, which to date are mostly constrained to contrastive approaches. Finally, we explore how other modalities are integrated with video and conduct a performance comparison on the most common benchmark for Video Transformers (i.e., action classification), finding them to outperform 3D CNN counterparts with equivalent FLOPs and no significant parameter increase.

show abstract

Section: Embeddingmentioning

confidence: 99%

Section: Restricted Approachesmentioning

confidence: 99%

mentioning

confidence: 99%

See 1 more Smart Citation

Video Transformers: A Survey

Selva¹,

Johansen²,

Escalera³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The source code is made publicly available on Github. Now many models [32], [33], [34], [35], [36], [37] are available to generate scene graphs from different perspectives, and some works even extend the scene graph generation task from images to videos [38], [39], [40], [41]. Two-stage methods following [2] are currently dominating scene graph generation: several works [9], [32], [42], [43] use residual neural networks with the global context to improve the quality of the generated scene graphs.…”

Section: Scene Graph Generationmentioning

confidence: 99%

“…Its encoder-decoder configuration and attention mechanism are also used to solve various computer vision tasks in different ways, e.g. object detection [18], human-object interaction (HOI) detection [61], and dynamic scene graph generation [39].…”

Section: Transformer and Set Predictionmentioning

confidence: 99%

RelTR: Relation Transformer for Scene Graph Generation

Yang¹,

Yang²,

Rosenhahn³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

Different objects in the same scene are more or less related to each other, but only a limited number of these relationships are noteworthy. Inspired by DETR, which excels in object detection, we view scene graph generation as a set prediction problem and propose an end-to-end scene graph generation model RelTR which has an encoder-decoder architecture. The encoder reasons about the visual feature context while the decoder infers a fixed-size set of triplets subject-predicate-object using different types of attention mechanisms with coupled subject and object queries. We design a set prediction loss performing the matching between the ground truth and predicted triplets for the end-to-end training. In contrast to most existing scene graph generation methods, RelTR is a one-stage method that predicts a set of relationships directly only using visual appearance without combining entities and labeling all possible predicates. Extensive experiments on the Visual Genome and Open Images V6 datasets demonstrate the superior performance and fast inference of our model.

show abstract