2021
DOI: 10.1109/access.2021.3083273
|View full text |Cite
|
Sign up to set email alerts
|

Efficient Spatio-Temporal Modeling Methods for Real-Time Violence Recognition

Abstract: Violence recognition is challenging since recognition must be performed on videos acquired by a lot of surveillance cameras at any time or place. It should make reliable detections in real time and inform surveillance personnel promptly when violent crimes take place. Therefore, we focus on efficient violence recognition for real-time and on-device operation, for easy expansion into a surveillance system with numerous cameras. In this paper, we propose a novel violence detection pipeline that can be combined w… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
16
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

1
5

Authors

Journals

citations
Cited by 54 publications
(16 citation statements)
references
References 52 publications
0
16
0
Order By: Relevance
“…It requires more parameters and FLOPS compared to the 2D convolution since the expanded kernels stride in time as well as space. Frame-grouping is proposed in our previous work [30] to give 2D CNNs the ability to learn spatio-temporal representations in videos. We make each three-channel frame into a single-channel frame z t ∈ R H×W by averaging across the channel axis and group three consecutive frames to learn spatio-temporal representations in a video with the conventional 2D-CNN backbones as follows:…”
Section: Action Classmentioning
confidence: 99%
See 4 more Smart Citations
“…It requires more parameters and FLOPS compared to the 2D convolution since the expanded kernels stride in time as well as space. Frame-grouping is proposed in our previous work [30] to give 2D CNNs the ability to learn spatio-temporal representations in videos. We make each three-channel frame into a single-channel frame z t ∈ R H×W by averaging across the channel axis and group three consecutive frames to learn spatio-temporal representations in a video with the conventional 2D-CNN backbones as follows:…”
Section: Action Classmentioning
confidence: 99%
“…We apply frame-grouping in the first convolutional layer of the networks to get a three-frame temporal receptive field. As discussed in [30], frame-grouping is effective to model shortterm dynamics. It can efficiently deal with some actions that can be captured in a short duration.…”
Section: Temporal Receptive Field Of the Proposed Modelmentioning
confidence: 99%
See 3 more Smart Citations