2021
DOI: 10.3390/electronics10131601
|View full text |Cite
|
Sign up to set email alerts
|

ViolenceNet: Dense Multi-Head Self-Attention with Bidirectional Convolutional LSTM for Detecting Violence

Abstract: Introducing efficient automatic violence detection in video surveillance or audiovisual content monitoring systems would greatly facilitate the work of closed-circuit television (CCTV) operators, rating agencies or those in charge of monitoring social network content. In this paper we present a new deep learning architecture, using an adapted version of DenseNet for three dimensions, a multi-head self-attention layer and a bidirectional convolutional long short-term memory (LSTM) module, that allows encoding r… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 41 publications
(25 citation statements)
references
References 53 publications
0
18
0
Order By: Relevance
“…Table 5 shows that our proposed model is more lightweight than previously proposed methods for violence detection. Although the models presented by Sudhakaran and Lanz [ 22 ], Akti et al [ 42 ], and Rendón-Segador et al [ 40 ] are slightly more accurate, our proposed model has a much lower count of parameters compared to these models, which makes our method faster and computationally efficient. The only model that has a lower number of parameters than ours was the end-to-end CNN-LSTM model presented by AlDahoul et al [ 43 ]; however, experiments showed that this model is less accurate and less precise (model precision is 72.53 ± 4.6%) than ours.…”
Section: Experiments and Resultsmentioning
confidence: 92%
“…Table 5 shows that our proposed model is more lightweight than previously proposed methods for violence detection. Although the models presented by Sudhakaran and Lanz [ 22 ], Akti et al [ 42 ], and Rendón-Segador et al [ 40 ] are slightly more accurate, our proposed model has a much lower count of parameters compared to these models, which makes our method faster and computationally efficient. The only model that has a lower number of parameters than ours was the end-to-end CNN-LSTM model presented by AlDahoul et al [ 43 ]; however, experiments showed that this model is less accurate and less precise (model precision is 72.53 ± 4.6%) than ours.…”
Section: Experiments and Resultsmentioning
confidence: 92%
“…For violence detection, they summarized the video sequences into dynamic images [53] and used these images to train a CNN classifier. Rendón-Segador et al [8] adopted a 3D DenseNet and combined it with a selfattention mechanism, and a bidirectional convolutional LSTM, to detect violence. The method relies on the optical flow as input, which is first encoded by the DenseNet into sequences of feature maps, and then passed on to self-attention and ConvLSTM layers before carrying out prediction by the fully connected layers of the classifier.…”
Section: Deep Learning-based Methodsmentioning
confidence: 99%
“…The need for improved techniques of autonomous detection is gaining more and more focus, mainly because of enormous amounts of surveillance data being generated and the impracticality of its manual monitoring because of the human toil involved. Several traditional (e.g., [3][4][5]) as well as deep learning-based methods (e.g., [6][7][8]) have focused on the problem. Abnormal events detection encompasses two types of video scenes: crowded and uncrowded [9].…”
Section: Introductionmentioning
confidence: 99%
“… Rendón-Segador et al (2021) present a new approach for determining whether a video has a violent scene or not, based on an adapted 3D DenseNet, for a multi-head self-attention layer, and a bidirectional ConvLSTM module that enables encoding relevant spatio-temporal features. In addition, an ablation analysis of the input frames is carried out, comparing dense optical flow and neighboring frames removal, as well as the effect of the attention layer, revealing that combining optical flow and the attention mechanism enhances findings by up to 4.4 percent.…”
Section: Classification Of Violence Detection Techniquesmentioning
confidence: 99%