On the Use of Visual Soft Semantics for Video Temporal Decomposition to Scenes

Proceedings of the 1st ACM International Workshop on Human Centered Event Understanding From Multimedia

2014

In this work we deal with the problem of summarizing image collections that correspond to a single event each. For this, we adopt a clustering-based approach, and we perform a comparative study of different clustering algorithms and image representations. As part of this study, we propose and examine the possibility of using trained concept detectors so as to represent each image with a vector of concept detector responses, which is then used as input to the clustering algorithms. A technique which indicates which concepts are the most informative ones for clustering is also introduced, allowing us to prune the employed concept detectors. Following the clustering, a summary of the collection (thus, also of the event) can be formed by selecting one or more images per cluster, according to different possible criteria. The combination of clustering and concept-based image representation is experimentally shown to result in the formation of clusters and summaries that match well the human expectations.

Section: Visual Concepts For Clusteringmentioning

confidence: 99%

Concept-based Image Clustering and Summarization of Event-related Image Collections

Papagiannopoulou

Proceedings of the 1st ACM International Workshop on Human Centered Event Understanding From Multimedia

2014

“…However, manual processing of large collections of video for extracting structural semantics is practically infeasible, and the state-of-the-art techniques for performing this task automatically generate results that still deviate considerably from perfection (e.g. [9], [10]). Therefore, it is by no means straightforward to say that video structural semantics extracted automatically by current stateof-the-art techniques are useful in interactive retrieval, nor is it of course possible to quantify their potential contribution without detailed experimentation.…”

Section: Retrievalmentioning

confidence: 99%

“…[10], [12], further exploit higher-level information such as visual concept and audio event detection results in order to come to a more accurate extraction of the videos' structural semantics. Specifically, in [10] the possibility of exploiting, for the purpose of video segmentation to scenes, semantic information coming from the analysis of the visual modality, was examined.…”

Section: Retrievalmentioning

confidence: 99%

“…For the purpose of the study presented in this work, 6 different variations of the method of [10] were used (Table II). These differ in the information they use as input for extracting the video structural semantics (i.e., lowlevel visual features only for variations M1 to M3; lowlevel features and the responses of visual concept detectors ("visual soft semantics") for variations M4 to M6) and in the setting of their parameters (i.e., the shot similarity threshold for variations M1 to M3; the number of considered concept detectors and the strategy for their selection for variations M4 to M6).…”

Section: Retrievalmentioning

confidence: 99%

“…These differ in the information they use as input for extracting the video structural semantics (i.e., lowlevel visual features only for variations M1 to M3; lowlevel features and the responses of visual concept detectors ("visual soft semantics") for variations M4 to M6) and in the setting of their parameters (i.e., the shot similarity threshold for variations M1 to M3; the number of considered concept detectors and the strategy for their selection for variations M4 to M6). In general, the variations of the scene segmentation algorithm that take into account visual soft semantics were shown in [10] to produce scene segmentation results that are in better agreement with ground truth scene boundaries. The interested reader is referred to the aforementioned work for further details on these algorithms.…”

Section: Retrievalmentioning

confidence: 99%

See 2 more Smart Citations

Improving Interactive Video Retrieval by Exploiting Automatically-Extracted Video Structural Semantics

2011 IEEE Fifth International Conference on Semantic Computing

Sidiropoulos

Kompatsiaris

2011

Self Cite

Abstract-In this work the contribution of automaticallyextracted (thus, imperfect) video structural semantics towards improving interactive video retrieval is examined. First, the automatic extraction of video structural semantics, i.e. the decomposition of the video into scenes that correspond to the different sub-stories or high-level events, is performed. Then, these are introduced to the interactive video retrieval paradigm. Finally, their potential contribution is experimentally evaluated. To this end, different members of a family of scene segmentation algorithms are applied to an extensive professional video collection coming from the TRECVID benchmarking activity; subsequently, a large number of user interactions with a retrieval system that exploits these structural semantics is simulated. The experimental results document the contribution of state-of-the-art automatically-extracted video structural semantics to the efficient and effective interactive video retrieval.

Local Invariant Feature Tracks for High-Level Video Feature Extraction

Lecture Notes in Electrical Engineering

Dimou

Kompatsiaris

2012

Self Cite

This paper builds upon previous work on local interest point detection and description to propose the extraction and representation of novel Local Invariant Feature Tracks (LIFT). These features compactly capture not only the spatial attributes of 2D local regions, as in SIFT and related techniques, but also their long-term trajectories in time. This and other desirable properties of LIFT allow the generation of Bags-of-Spatiotemporal-Words models that facilitate capturing the dynamics of video content, which is necessary for detecting high-level video features that by definition have a strong temporal dimension. Preliminary experimental evaluation and comparison of the proposed approach reveals promising results.