2012
DOI: 10.1007/978-1-4614-3831-1_1
|View full text |Cite
|
Sign up to set email alerts
|

On the Use of Audio Events for Improving Video Scene Segmentation

Abstract: This work deals with the problem of automatic temporal segmentation of a video into elementary semantic units known as scenes. Its novelty lies in the use of high-level audio information in the form of audio events for the improvement of scene segmentation performance. More specifically, the proposed technique is built upon a recently proposed audio-visual scene segmentation approach that involves the construction of multiple scene transition graphs (STGs) that separately exploit information coming from differ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0

Year Published

2013
2013
2020
2020

Publication Types

Select...
6
2

Relationship

2
6

Authors

Journals

citations
Cited by 16 publications
(12 citation statements)
references
References 15 publications
0
12
0
Order By: Relevance
“…With the ability to learn non-obvious relationships between input features and output class, the adoption of machine learning techniques also naturally encouraged the use of more complex input features [18] including MFCCs [7] and perceptual linear prediction (PLP) coefficients [17]. Classification (and recall) techniques for such systems have, in recent years, most commonly involved support vector machines (SVM) [17], [19], Gaussian mixture models (GMMs) [20] or multilayer perceptrons (MLP) [21]. Research in machine hearing is often driven by the success of techniques used for ASR, hence a number of published techniques which make use of MFCC features [11], hidden Markov model toolkit (HTK) and associated back-end classifiers [15].…”
Section: Robust Sound Event Classification Using Deep Neural Networkmentioning
confidence: 99%
“…With the ability to learn non-obvious relationships between input features and output class, the adoption of machine learning techniques also naturally encouraged the use of more complex input features [18] including MFCCs [7] and perceptual linear prediction (PLP) coefficients [17]. Classification (and recall) techniques for such systems have, in recent years, most commonly involved support vector machines (SVM) [17], [19], Gaussian mixture models (GMMs) [20] or multilayer perceptrons (MLP) [21]. Research in machine hearing is often driven by the success of techniques used for ASR, hence a number of published techniques which make use of MFCC features [11], hidden Markov model toolkit (HTK) and associated back-end classifiers [15].…”
Section: Robust Sound Event Classification Using Deep Neural Networkmentioning
confidence: 99%
“…However, manual processing of large collections of video for extracting structural semantics is practically infeasible, and the state-of-the-art techniques for performing this task automatically generate results that still deviate considerably from perfection (e.g. [9], [10]). Therefore, it is by no means straightforward to say that video structural semantics extracted automatically by current stateof-the-art techniques are useful in interactive retrieval, nor is it of course possible to quantify their potential contribution without detailed experimentation.…”
Section: Retrievalmentioning
confidence: 99%
“…The calculation of c(s) is followed by a normalization step. Similarly to audio events, discussed in [7], different visual concepts may have different frequency of appearance in a given video (i.e. some concepts are more rare than others).…”
Section: Shot Representation and Similarity Evaluationmentioning
confidence: 99%
“…The commonly used Minkowski distance does not satisfy the above requirement, since it depends only on the difference of the confidence values. Instead of it, a variation of the Chi-test distance, that was shown to be useful when considering audio events [7], is employed in this work. Thus, the distance D ofc(s i ) andc(s k ) is defined as:…”
Section: Shot Representation and Similarity Evaluationmentioning
confidence: 99%
See 1 more Smart Citation