2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.01017
|View full text |Cite
|
Sign up to set email alerts
|

Dance With Flow: Two-In-One Stream Action Detection

Abstract: The goal of this paper is to detect the spatio-temporal extent of an action. The two-stream detection network based on RGB and flow provides state-of-the-art accuracy at the expense of a large model-size and heavy computation. We propose to embed RGB and optical-flow into a single twoin-one stream network with new layers. A motion condition layer extracts motion information from flow images, which is leveraged by the motion modulation layer to generate transformation parameters for modulating the low-level RGB… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
45
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 89 publications
(45 citation statements)
references
References 45 publications
0
45
0
Order By: Relevance
“…Resolution of individual frames remains the same (240 × 320) without cropping and resizing, and multiple modalities utilizing 2stream frame -RGB and stacked optical flow are exploited. Proposals of bounding boxes from each stream are merged with non-maximum suppression (NMS)(refer to [24] for more detail).…”
Section: B Setups and Hyper-parametersmentioning
confidence: 99%
“…Resolution of individual frames remains the same (240 × 320) without cropping and resizing, and multiple modalities utilizing 2stream frame -RGB and stacked optical flow are exploited. Proposals of bounding boxes from each stream are merged with non-maximum suppression (NMS)(refer to [24] for more detail).…”
Section: B Setups and Hyper-parametersmentioning
confidence: 99%
“…Such motion-based attention can be especially useful for the intended setting utilizing stationary cameras. Building on this technique of motion-based attention, Zhao and Snoek developed an algorithm for detecting the spatiotemporal extent of actions by embedding the RGB spatial and optical flow temporal streams into a single two-in-one stream network [101]. Aside from simplifying the computation of action recognition, their approach also assigns motion direction to the actor as an extra feature distinctive to many actions (e.g., the difference between sitting down and standing up, or PA-entailing motion towards a direction relevant to the research questions studied with proposed smart sensors).…”
Section: The Promise Of Computer Visionmentioning
confidence: 99%
“…Following architectures have focused on modelling longer temporal structure, through consensus of predictions over time [30,58,63] as well as inflating CNNs to 3D convolutions [6], all using the two-stream approach of latefusing RGB and Flow. The latest architectures have focused on reducing the high computational cost of 3D convolutions [12,21,61], yet still show improvements when reporting results of two-stream fusion [61]. Self-supervision for Action Recognition.…”
Section: Related Workmentioning
confidence: 99%