Detecting Unseen Visual Relations Using Analogies

Peyre, Julia; Šivic, Josef; Laptev, Ivan; Schmid, Cordelia

doi:10.1109/iccv.2019.00207

Cited by 122 publications

(150 citation statements)

References 31 publications

(60 reference statements)

Supporting

Mentioning

149

Contrasting

Order By: Relevance

“…model [33] have shown that using both unigram and trigram representations of HOIs may solve the above contradiction. Nonetheless, all these methods ignore the implicit relations among HOI categories, thus we extend the hybrid model by aggregating common sense knowledge for generating semantic embeddings.…”

Section: Related Workmentioning

confidence: 99%

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

Liu

Yuan

Chen

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

We consider the problem of Human-Object Interaction (HOI) Detection, which aims to locate and recognize HOI instances in the form of ⟨ℎ , , ⟩ in images. Most existing works treat HOIs as individual interaction categories, thus can not handle the problem of long-tail distribution and polysemy of action labels. We argue that multi-level consistencies among objects, actions and interactions are strong cues for generating semantic representations of rare or previously unseen HOIs. Leveraging the compositional and relational peculiarities of HOI labels, we propose ConsNet, a knowledge-aware framework that explicitly encodes the relations among objects, actions and interactions into an undirected graph called consistency graph, and exploits Graph Attention Networks (GATs) to propagate knowledge among HOI categories as well as their constituents. Our model takes visual features of candidate human-object pairs and word embeddings of HOI labels as inputs, maps them into visual-semantic joint embedding space and obtains detection results by measuring their similarities. We extensively evaluate our model on the challenging V-COCO and HICO-DET datasets, and results validate that our approach outperforms stateof-the-arts under both fully-supervised and zero-shot settings. CCS CONCEPTS • Computing methodologies → Activity recognition and understanding; Scene understanding.

show abstract

Section: Related Workmentioning

confidence: 99%

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

Liu

Yuan

Chen

2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

show abstract

“…Video Visual Relation Detection. Compare to ImgVRD [5,13,15,[33][34][35], VidVRD has not received sufficient attention until the recent due to its complexity and a lack of suitable dataset. [19] contributed ImageNet-VidVRD dataset which labels all relation triplets in video as well as the trajectories of corresponding subject and object and becomes the first dataset on video visual relation detection.…”

Section: Related Workmentioning

confidence: 99%

“…Unlike visual relation detection in image (ImgVRD) that has been widely studied for years [5,13,15,[33][34][35], its counterpart in video domain has just attracted researchers' attention [16,19,23]. Video visual relation detection (VidVRD) requires to track the objects and their pairwise relations in a video.…”

Section: Introductionmentioning

confidence: 99%

Video Relation Detection via Multiple Hypothesis Association

Shang

Chen

et al. 2020

Proceedings of the 28th ACM International Conference on Multimedia

View full text Add to dashboard Cite

Video visual relation detection (VidVRD) aims at obtaining not only the trajectories of objects but also the dynamic visual relations between them. It provides abundant information for video understanding and can serve as a bridge between vision and language. Compared with visual relation detection on image, VidVRD requires one more step at last called visual relation association which associates relation segments across time dimension into video relations. This step plays an important role in the task but is less studied. Nevertheless, visual relation association is a difficult task as the association process is easily affected by inaccurate tracklet detection and relation prediction in the former steps. In this paper, we propose a novel relation association method called Multiple Hypothesis Association (MHA). It maintains multiple possible relation hypothesis during the association process in order to tolerate and handle the inaccurate or missing problem in the former steps and generate more accurate video relations. Our experiments on the benchmark datasets (Imagenet-VidVRD and VidOR) show that our method outperforms the state-of-the-art methods.

show abstract

“…A single predicate could introduce up to 20 2 new relationship categories, for which samples must be collected and models should be trained. Moreover, we know that the distribution of naturally-occurring triplets is longtailed, with combinations such as person ride dog rarely appearing [29]. This exposes standard training methods to issues arising from extreme class imbalance.…”

Section: Introductionmentioning

confidence: 99%

Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph Networks

Baldassarre

Smith

Sullivan

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Visual relationship detection is fundamental for holistic image understanding. However, the localization and classification of (subject, predicate, object) triplets remain challenging tasks, due to the combinatorial explosion of possible relationships, their long-tailed distribution in natural images, and an expensive annotation process. This paper introduces a novel weakly-supervised method for visual relationship detection that relies on minimal image-level predicate labels. A graph neural network is trained to classify predicates in images from a graph representation of detected objects, implicitly encoding an inductive bias for pairwise relations. We then frame relationship detection as the explanation of such a predicate classifier, i.e. we obtain a complete relation by recovering the subject and object of a predicted predicate. We present results comparable to recent fully-and weakly-supervised methods on three diverse and challenging datasets: HICO-DET for humanobject interaction, Visual Relationship Detection for generic object-toobject relations, and UnRel for unusual triplets; demonstrating robustness to non-comprehensive annotations and good few-shot generalization.

show abstract

Detecting Unseen Visual Relations Using Analogies

Cited by 122 publications

References 31 publications

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

ConsNet: Learning Consistency Graph for Zero-Shot Human-Object Interaction Detection

Video Relation Detection via Multiple Hypothesis Association

Explanation-Based Weakly-Supervised Learning of Visual Relations with Graph Networks

Contact Info

Product

Resources

About