2010
DOI: 10.1007/978-3-642-12842-4_13
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Fusion for Detecting Violent Scenes in Videos

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
42
0
3

Year Published

2014
2014
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 78 publications
(45 citation statements)
references
References 10 publications
0
42
0
3
Order By: Relevance
“…More recently, Yan et al [17] have developed a Multi-task Learning approach for head-pose estimation in a multicamera environment under target motion. Giannakopoulos et al [5], in an attempt to extend their approach based solely on audio cues [7], propose to use a multimodal two-stage approach based on k nearest neighbors (k-NN). In the first step, they perform audio and visual analysis of segments of one second duration.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…More recently, Yan et al [17] have developed a Multi-task Learning approach for head-pose estimation in a multicamera environment under target motion. Giannakopoulos et al [5], in an attempt to extend their approach based solely on audio cues [7], propose to use a multimodal two-stage approach based on k nearest neighbors (k-NN). In the first step, they perform audio and visual analysis of segments of one second duration.…”
Section: Related Workmentioning
confidence: 99%
“…Recently, solutions using mid-level feature representations have gained popularity. These solutions shifted away not only from the traditional approaches which represented videos using low-level features (e.g., [4,5]) but also from the use of state-of-theart detectors designed to identify high-level semantic concepts (e.g., "a killing spree"). The earlier solutions could not carry enough semantic information, and the latter ones have not reached a sufficient level of maturity.…”
Section: Introductionmentioning
confidence: 99%
“…This causes researchers difficulty in terms of working on a common ground [1]. Some of the violence interpretations include violent actions by humans where there is blood [2], scenes containing gunshots, fights and explosions [3], person to person harmful acts like threatening and physical harm [4], and fighting scenes regardless of number of individuals involved and context [5,6]. These different interpretations lead to different techniques for VSD, which makes it difficult to conduct a comparative study.…”
Section: Introductionmentioning
confidence: 99%
“…Many researchers, on the other hand, have been interested in combining both auditory and visual modalities. The combined use of audio (e.g., chroma, spectrogram and Mel-Frequency Cepstral Coefficients (MFCC)) and visual features (e.g., motion based variance, motion of people and average motion) produced some good results [4]. In [9], the authors performed a modified probabilistic Latent Semantic Analysis (pLSA) based violence detection from audio cues and visual information by exploiting different concepts (including explosion, motion, blood and flame etc.).…”
Section: Introductionmentioning
confidence: 99%
“…Três dos descritores de áudio mais comumente utilizados são MFCC (Mel-Frequency Cepstrum Coefficients), LPCC (Linear Prediction Cepstral Coefficients) e Chroma (Giannakopoulos et al, 2010).…”
Section: Segmentação Em Cenas Com Descritores Sonorosunclassified