ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413686
|View full text |Cite
|
Sign up to set email alerts
|

Violence Detection in Videos Based on Fusing Visual and Audio Information

Abstract: Determining whether given video frames contain violent content is a basic problem in violence detection. Visual and audio information are useful for detecting violence included in a video, and are usually complementary; however, violence detection studies focusing on fusing visual and audio information are relatively rare. Therefore, we explored methods for fusing visual and audio information. We proposed a neural network containing three modules for fusing multimodal information: 1) attention module for utili… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
18
1

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 42 publications
(21 citation statements)
references
References 25 publications
0
18
1
Order By: Relevance
“…Finally, a mutual learning module is added to make the model learn visual information from another neural network with a different structure. Unlike Pang's method [26], we found that the violence detection accuracy of different kinds of videos is related to the change of optical flow. Therefore, our network added the optical flow feature to extract the motion features of objects and effectively solve the problems of short duration and weak action for the task of violence detection.…”
Section: Introductioncontrasting
confidence: 62%
See 1 more Smart Citation
“…Finally, a mutual learning module is added to make the model learn visual information from another neural network with a different structure. Unlike Pang's method [26], we found that the violence detection accuracy of different kinds of videos is related to the change of optical flow. Therefore, our network added the optical flow feature to extract the motion features of objects and effectively solve the problems of short duration and weak action for the task of violence detection.…”
Section: Introductioncontrasting
confidence: 62%
“…By using a sequence composed of multiple segments as an optimization unit, they reduced the probability of selection errors during training. Pang et al [ 26 ] further improved the algorithm on the basis of Wu et al [ 22 ], who focused on fusing audio and visual information. First, weighted features are used to generate effective features under the guidance of audio and visual information.…”
Section: Introductionmentioning
confidence: 99%
“…Problem with alignment of text and video: As we highlight in the second subsection of our related works, BLP has yielded great performance in video tasks where it fuses the visual features with non-textual features. Audio and visual feature fusion demonstrates impressive performance on action recognition( Hu et al, 2021 ), emotion recognition ( Zhou et al, 2021 ), and violence detection ( Pang et al, 2021 ). Likewise, different visual representations have thrived in RGBT tracking ( Xu et al, 2021 ), action recognition ( Deng et al, 2021 ) and video-QA on MSVD-QA ( Wang, Bao & Xu, 2021 ).…”
Section: Discussionmentioning
confidence: 99%
“… Hu et al (2021) use compact BLP to fuse audio and ‘visual long range’ features for human action recognition. Pang et al (2021) use MLB as part of an attention-based fusion for audio and visual features for violence detection in videos. Xu et al (2021) use BLP to fuse visual features from different channels in colour image (RGB) and thermal infrared tracking (TiR) i.e.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation