Audiovisual Attention Modeling and Salient Event Detection

Evangelopoulos, Georgios; Rapantzikos, Konstantinos; Maragos, Petros; Avrithis, Yannis; Potamianos, Alexandros

doi:10.1007/978-0-387-76316-3_8

Cited by 20 publications

(13 citation statements)

References 35 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…where a N (t) is the amplitude,  N (t) is the frequency, and N is the size of the audio sequence. The audio attention model is based on three features: the maximum Teager energy, M TE : the mean instant amplitude, M IA : and the mean instant frequency, M IF [47]. The first, M TE captures the joint amplitude-frequency information of the audio activity, which represents the dominant signal modulation energy.…”

Section: Audio Visual Attention Modelingmentioning

confidence: 99%

Divide-and-conquer based summarization framework for extracting affective video content

et al. 2016

View full text Add to dashboard Cite

Recent advances in multimedia technology have led to tremendous increases in the available volume of video data, thereby creating a major requirement for efficient systems to manage such huge data volumes. Video summarization is one of the key techniques for accessing and managing large video libraries. Video summarization can be used to extract the affective contents of a video sequence to generate a concise representation of its content. Human attention models are an efficient means of affective content extraction. Existing visual attention driven summarization frameworks have high computational cost and memory requirements, as well as a lack of efficiency in accurately perceiving human attention. To cope with these issues, we propose a divide-and-conquer based framework for an efficient summarization of big video data. We divide the original video data into shots, where an attention model is computed from each shot in parallel. Viewer attention is based on multiple sensory perceptions, i.e., aural and visual, as well as the viewer's neuronal signals. The aural attention model is based on the Teager energy, instant amplitude, and instant frequency, whereas the visual attention model employs multi-scale contrast and motion intensity. Moreover, the neuronal attention is computed using the beta-band frequencies of neuronal signals. Next, an aggregated attention curve is generated using an intra-and inter-modality fusion mechanism. Finally, the affective content in each video shot is extracted. The fusion of multimedia and neuronal signals provides a bridge that links the digital representation of multimedia with the viewer's perceptions. Our experimental results indicate that the proposed shot-detection based divide-and-conquer strategy mitigates the time and computational complexity. Moreover, the proposed attention model provides an accurate reflection of the user preferences and facilitates the extraction of highly affective and personalized summaries.

show abstract

Section: Audio Visual Attention Modelingmentioning

confidence: 99%

Divide-and-conquer based summarization framework for extracting affective video content

et al. 2016

View full text Add to dashboard Cite

show abstract

“…The system is based on a modulation model for speech signals motivated by physical observations during speech production [18], the microproperties of speech signals, and a detection-theoretic optimality criterion. The features involved in the decision process have been previously used with success for speech endpoint detection in isolated word and sentences, VAD in large-scale databases and audio saliency modeling [19]. Moreover the developed VAD, based on divergence measures has been systematically compared in [17] with recent, high detection rate VAD [16], which in turn was evaluated against common standards.…”

Section: Audio Activity Detectionmentioning

confidence: 99%

Audio-Assisted Movie Dialogue Detection

Kotti

Ververidis

Evangelopoulos

et al. 2008

IEEE Trans. Circuits Syst. Video Technol.

View full text Add to dashboard Cite

Abstract-An audio-assisted system is investigated that detects if a movie scene is a dialogue or not. The system is based on actor indicator functions. That is, functions which define if an actor speaks at a certain time instant. In particular, the crosscorrelation and the magnitude of the corresponding the crosspower spectral density of a pair of indicator functions are input to various classifiers, such as voted perceptrons, radial basis function networks, random trees, and support vector machines for dialogue/non-dialogue detection. To boost classifier efficiency AdaBoost is also exploited. The aforementioned classifiers are trained using ground truth indicator functions determined by human annotators for 41 dialogue and another 20 non-dialogue audio instances. For testing, actual indicator functions are derived by applying audio activity detection and actor clustering to audio recordings. 23 instances are randomly chosen among the aforementioned 41 dialogue instances, 17 of which correspond to dialogue scenes and 6 to non-dialogue ones. Accuracy ranging between 0.739 and 0.826 is reported.

show abstract

“…In addition to classic ones such as indexing and summarization, applications focused more on higher level video understanding [1,2] have demonstrated significant promise. In the domain of movie content processing, various tasks such as narrative act structure characterization, violent scene detection and saliency prediction [3,4] for regions of potential greater engagement are some examples of interesting applications. Many of these methods help to analyze movie datasets at scale making it easier for human experts to perform higher level analytics and decision making.…”

Section: Introductionmentioning

confidence: 99%

Robust Multichannel Gender Classification from Speech in Movie Audio

et al. 2016

View full text Add to dashboard Cite

Speech in the form of scripted dialogues forms an important part of the audio signal in movies. However, it is often masked by background audio signals such as music, ambient noise or background chatter. These background sounds make even otherwise simple tasks, such as gender classification, challenging. Additionally, the variability in this noise across movies renders standard approaches to source separation or enhancement inadequate. Instead, we exploit multichannel information present in different language channels (English, Spanish, French) for each movie to improve the robustness of our gender classification system. We exploit the fact that the speaker labels of interest in this case co-occur in each language channel. We fuse the predictions obtained for each channel using Recognition Output Voting Error Reduction (ROVER) and show that this approach improves the gender accuracy by 7% absolute (11% relative) compared to the best independent prediction on any single channel. In the case of surround movies, we further investigate fusion of mono audio and front center channels which shows 5% and 3% absolute (8% and 4% relative) increase in accuracy compared to only using mono and front center channel, respectively.

show abstract

Audiovisual Attention Modeling and Salient Event Detection

Cited by 20 publications

References 35 publications

Divide-and-conquer based summarization framework for extracting affective video content

Divide-and-conquer based summarization framework for extracting affective video content

Audio-Assisted Movie Dialogue Detection

Robust Multichannel Gender Classification from Speech in Movie Audio

Contact Info

Product

Resources

About