2021
DOI: 10.48550/arxiv.2103.04680
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Time and Frequency Network for Human Action Detection in Videos

Abstract: Currently, spatiotemporal features are embraced by most deep learning approaches for human action detection in videos, however, they neglect the important features in frequency domain. In this work, we propose an end-to-end network that considers the time and frequency features simultaneously, named TFNet. TFNet holds two branches, one is time branch formed of three-dimensional convolutional neural network(3D-CNN), which takes the image sequence as input to extract time features; and the other is frequency bra… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(3 citation statements)
references
References 22 publications
0
3
0
Order By: Relevance
“…The frame-level detectors [56,69] generate final action tubes by linking frame level detection results. To take full advantage of temporal information, some clip-level detectors have been proposed [24,40,44,45,51,55,60,66,74,79]. ACT [40] handled a short sequence of frames and output action tubes by regressing from anchor cuboids.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…The frame-level detectors [56,69] generate final action tubes by linking frame level detection results. To take full advantage of temporal information, some clip-level detectors have been proposed [24,40,44,45,51,55,60,66,74,79]. ACT [40] handled a short sequence of frames and output action tubes by regressing from anchor cuboids.…”
Section: Related Workmentioning
confidence: 99%
“…Sarmiento et al [60] introduced two cross attention blocks to effectively model the spatial relations and capture short range temporal interactions. In recent years, some approaches attempt to recognize action based on 3D convolution features [32,44,51,55,79]. ACDnet [51] intelligently exploited the temporal coherence between successive video frames to approximate their CNN features rather than naively extracting them.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation