Is Object Detection Necessary for Human-Object Interaction Recognition?

Jin, Ying; Chen, Yinpeng; Wang, Lijuan; Wang, Jianfeng; Yu, Pei; Liu, Zicheng; Hwang, Jenq-Neng

doi:10.48550/arxiv.2107.13083

Cited by 2 publications

(2 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The training set includes 38116 images and the test set includes 9658 images. For a fair comparison, we follow the standard practice and mainly focus on those previous methods that do not require extra supervision (Fang et al, 2018) or data (Li et al, 2020b;2019b;Jin et al, 2021). By default, we choose PVTv2-b2 (Wang et al, 2021b) as the ViT backbone.…”

Section: Main Results I: Human-object Interaction Recognitionmentioning

confidence: 99%

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

Ma¹,

Nie²,

Yu³

et al. 2022

Preprint

View full text Add to dashboard Cite

Reasoning about visual relationships is central to how humans interpret the visual world. This task remains challenging for current deep learning algorithms since it requires addressing three key technical problems jointly: 1) identifying object entities and their properties, 2) inferring semantic relations between pairs of entities, and 3) generalizing to novel object-relation combinations, i.e. systematic generalization. In this work, we use vision transformers (ViTs) as our base model for visual reasoning and make better use of concepts defined as object entities and their relations to improve the reasoning ability of ViTs. Specifically, we introduce a novel concept-feature dictionary to allow flexible image feature retrieval at training time with concept keys. This dictionary enables two new conceptguided auxiliary tasks: 1) a global task for promoting relational reasoning, and 2) a local task for facilitating semantic object-centric correspondence learning. To examine the systematic generalization of visual reasoning models, we introduce systematic splits for the standard HICO and GQA benchmarks. We show the resulting model, Concept-guided Vision Transformer (or RelViT for short) significantly outperforms prior approaches on HICO and GQA by 16% and 13% in the original split, and by 43% and 18% in the systematic split. Our ablation analyses also reveal our model's compatibility with multiple ViT variants and robustness to hyper-parameters.

show abstract

Section: Main Results I: Human-object Interaction Recognitionmentioning

confidence: 99%

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

Ma¹,

Nie²,

Yu³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Gan et al (2017) and Zhao et al (2020) have suggested style-guided captioning, but also employ training over paired data. CLIP (2021) marked a turning point in visionlanguage perception, and has been utilized for vision-related tasks by various distillation techniques Song et al, 2022;Jin et al, 2021;Gal et al, 2021;Khandelwal et al, 2022). Recent captioning methods use CLIP for reducing training time (Mokady et al, 2021), improved captions (Shen et al, 2021;Luo et al, 2022a,b;Cornia et al, 2021;Kuo and Kira, 2022), and in zero-shot settings (Su et al, 2022;Tewel et al, 2022).…”

Section: Related Workmentioning

confidence: 99%

Text-Only Training for Image Captioning using Noise-Injected CLIP

Nukrai¹,

Mokady²,

Globerson³

2022

Preprint

View full text Add to dashboard Cite

We consider the task of image-captioning using only the CLIP model and additional text data at training time, and no additional captioned images. Our approach relies on the fact that CLIP is trained to make visual and textual embeddings similar. Therefore, we only need to learn how to translate CLIP textual embeddings back into text, and we can learn how to do this by learning a decoder for the frozen CLIP text encoder using only text. We argue that this intuition is "almost correct" because of a gap between the embedding spaces, and propose to rectify this via noise injection during training. We demonstrate the effectiveness of our approach by showing SOTA zero-shot image captioning across four benchmarks, including style transfer. Code, data, and models are available at https://github. com/DavidHuji/CapDec.

show abstract

Is Object Detection Necessary for Human-Object Interaction Recognition?

Cited by 2 publications

References 35 publications

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

RelViT: Concept-guided Vision Transformer for Visual Relational Reasoning

Text-Only Training for Image Captioning using Noise-Injected CLIP

Contact Info

Product

Resources

About