Automatic trailer generation

Irie, Go; Satou, Takashi; Kojima, Akira; Yamasaki, Toshihiko; Aizawa, Kiyoharu

doi:10.1145/1873951.1874092

Cited by 26 publications

(15 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recognising the sparseness of this classification feature space [1][2], we propose a novel methodology that utilises the representative power of compressive sensing [14] to facilitate the derivation of truly fused audio-visual words over which we then apply both SVM and Decision Forest [15] classification techniques. As the goal of our approach is general and unconstrained audio-visual scene classification, it is different from earlier feature concatenation techniques, such as the Audio-Video Concurrence (AVC) utilised in [7] or Affective Audio-Visual Words employed in [6]. In contrast to these and earlier feature concatenation based approaches we show that our compressive audio-visual feature representation facilitates a significant reduction in dimensionality with only marginal impact on the resulting classification performance.…”

Section: Introductionmentioning

confidence: 79%

“…The simple concatenation of multi-modal feature representations is a commonplace [5] [6]. By contrast, here we look to the use of a compressive sensing derived methodology [14][24] as an approach for the combinatorial mapping of a multi-modal feature space into a single compressed multi-dimensional representation.…”

Section: Multi-modal Feature Representationmentioning

confidence: 99%

“…Prior work on scene classification uses a range of both audio [2][5], visual [1] and multi-modal feature classification [6] [7]. An existing approach to environment and event classification presented by [2] is purely based on audio signal.…”

Section: Introductionmentioning

confidence: 99%

“…In prior audio-visual classification work a notable task is that of emotion classification [6]. In this work [6] the model proposed is responsible for assigning a set of audiovisual features to a number of emotion classes with the unsolved problem identified as feature selection.…”

Section: Introductionmentioning

confidence: 99%

“…In this work [6] the model proposed is responsible for assigning a set of audiovisual features to a number of emotion classes with the unsolved problem identified as feature selection. The overall accuracy achieved for this method was ~85%.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Using compressed audio-visual words for multi-modal scene classification

Kurcius

Breckon

2014

2014 International Workshop on Computational Intelligence for Multimedia Understanding (IWCIM)

View full text Add to dashboard Cite

any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Additional information:Use policyThe full-text may be used and/or reproduced, and given to third parties in any format or medium, without prior permission or charge, for personal research or study, educational, or not-for-prot purposes provided that:• a full bibliographic reference is made to the original source • a link is made to the metadata record in DRO • the full-text is not changed in any way The full-text must not be sold in any format or medium without the formal permission of the copyright holders.Please consult the full DRO policy for further details. ABSTRACTWe present a novel approach to scene classification using combined audio signal and video image features and compare this methodology to scene classification results using each modality in isolation. Each modality is represented using summary features, namely Mel-frequency Cepstral Coefficients (audio) and Scale Invariant Feature Transform (SIFT) (video) within a multi-resolution bag-offeatures model. Uniquely, we extend the classical bag-ofwords approach over both audio and video feature spaces, whereby we introduce the concept of compressive sensing as a novel methodology for multi-modal fusion via audiovisual feature dimensionality reduction. We perform evaluation over a range of environments showing performance that is both comparable to the state of the art (86%, over ten scene classes) and invariant to a ten-fold dimensionality reduction within the audio-visual feature space using our compressive representation approach.

show abstract