2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00710
|View full text |Cite
|
Sign up to set email alerts
|

Attend and Interact: Higher-Order Object Interactions for Video Understanding

Abstract: Human actions often involve complex interactions across several inter-related objects in the scene. However, existing approaches to fine-grained video understanding or visual relationship detection often rely on single object representation or pairwise object relationships. Furthermore, learning interactions across multiple objects in hundreds of frames for video is computationally infeasible and performance may suffer since a large combinatorial space has to be modeled. In this paper, we propose to efficientl… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
97
0

Year Published

2018
2018
2023
2023

Publication Types

Select...
4
3
1

Relationship

1
7

Authors

Journals

citations
Cited by 142 publications
(97 citation statements)
references
References 58 publications
0
97
0
Order By: Relevance
“…Another branch of work utilizes optical flow to compensate for the lack of temporal information in raw RGB frames [42,9,49,3,29]. Moreover, some works extract temporal dependencies between frames for video tasks by utilizing recurrent neural networks (RNNs) [6], attention [28,30] and relation modules [57]. Note that we focus on attending to the temporal dynamics to effectively align domains and we consider other modalities, e.g.…”
Section: Related Workmentioning
confidence: 99%
“…Another branch of work utilizes optical flow to compensate for the lack of temporal information in raw RGB frames [42,9,49,3,29]. Moreover, some works extract temporal dependencies between frames for video tasks by utilizing recurrent neural networks (RNNs) [6], attention [28,30] and relation modules [57]. Note that we focus on attending to the temporal dynamics to effectively align domains and we consider other modalities, e.g.…”
Section: Related Workmentioning
confidence: 99%
“…Visual attention usually comes in the form of temporal attention [35] (or spatial-attention [33] in the image domain), semantic attention [14,36,37,42] or both [20]. The recent unprecedented success in object detection [24,7] has regained the community's interests on detecting fine-grained visual clues while incorporating them into end-toend networks [17,27,1,16]. Description methods which are based on object detectors [17,39,1,16,5,13] tackle the captioning problem in two stages.…”
Section: Related Workmentioning
confidence: 99%
“…The recent unprecedented success in object detection [24,7] has regained the community's interests on detecting fine-grained visual clues while incorporating them into end-toend networks [17,27,1,16]. Description methods which are based on object detectors [17,39,1,16,5,13] tackle the captioning problem in two stages. They first use offthe-shelf or fine-tuned object detectors to propose object proposals/detections as for the visual recognition heavylifting.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Visual relationship modeling for human-human and human-object pairs increases performance in a variety of tasks including action recognition [56] and image captioning [35,38]. There have been several works [5,13,16] on human-object interaction modeling in images that achieved significant improvements on HICO-DET [6] and V-COCO [32] datasets.…”
Section: Related Workmentioning
confidence: 99%