Audio and video parts of an audiovisual document interact to produce an audiovisual, or multi-modal, perception. Yet, automatic analysis on these documents are usually based on separate audio and video annotations. Regarding the audiovisual content, these annotations could be incomplete, or not relevant. Besides, the expanding possibilities of creating audiovisual documents lead to consider different kinds of contents, including videos filmed in uncontrolled conditions (i.e. fields recordings), or scenes filmed from different points of view (multi-view). In this paper we propose an original procedure to produce manual annotations in different contexts, including multi-modal and multi-view documents. This procedure, based on using both audio and video annotations, ensures consistency considering audio or video only, and provides additionally audiovisual information at a richer level. Finally, different applications are made possible when considering such annotated data. In particular, we present an example application in a network of recordings in which our annotations allow multi-source retrieval using mono or multi-modal queries.
In surveillance applications, humans and vehicles are the most important common elements studied. In consequence, detecting and matching a person or a car that appears on several videos is a key problem. Many algorithms have been introduced and nowadays, a major relative problem is to evaluate precisely and to compare these algorithms, in reference to a common ground-truth. In this paper, our goal is to introduce a new dataset for evaluating multi-view based methods. This dataset aims at paving the way for multidisciplinary approaches and applications such as 4D-scene reconstruction, object identification/tracking, audio event detection and multi-source meta-data modeling and querying. Consequently, we provide two sets of 25 synchronized videos with audio tracks, all depicting the same scene from multiple viewpoints, each set of videos following a detailed scenario consisting in comings and goings of people and cars. Every video was annotated by regularly drawing bounding boxes on every moving object with a flag indicating whether the object is fully visible or occluded, specifying its category (human or vehicle), providing visual details (for example clothes types or colors), and timestamps of its apparitions and disappearances. Audio events are also annotated by a category and timestamps.
Herein, the problem of vehicle re‐identification using distance comparison of images in CNN latent spaces is addressed. First, the impact of the distance metrics, comparing performances obtained with different metrics is studied: the minimal Euclidean distance (MED), the minimal cosine distance (MCD) and the residue of the sparse coding reconstruction (RSCR). These metrics are applied using features extracted from five different CNN architectures, namely ResNet18, AlexNet, VGG16, InceptionV3 and DenseNet201. We use the specific vehicle re‐identification dataset VeRi to fine‐tune these CNNs and evaluate results. Overall, independently of the CNN used, MCD outperforms MED, commonly used in the literature. These results are confirmed on other vehicle retrieval datasets. Second, the state‐of‐the‐art image‐to‐track process (I2TP) is extended to a track‐to‐track process (T2TP). The three distance metrics are extended to measure distance between tracks, enabling T2TP. T2TP and I2TP are compared using the same CNN models. Results show that T2TP outperforms I2TP for MCD and RSCR. T2TP combining DenseNet201 and MCD‐based metrics exhibits the best performances, outperforming the state‐of‐the‐art I2TP‐based models. Finally, experiments highlight two main results: i) the impact of metric choice in vehicle re‐identification, and ii) T2TP improves the performances compared with I2TP, especially when coupled with MCD‐based metrics.
We present an approach for ranking a collection of videos with overlapping fields of view. The ranking depends on how they allow to visualize as best as possible, i.e. with significant details, a trajectory query drawn in one of the videos. The proposed approach decomposes each video into cells and aims at estimating a correspondence map between cells from different videos using the linear correlation between their functions of activity. These latter are obtained during a training session by detecting objects in the videos and computing the coverage rate between the objects and the cells over time. The main idea is that two areas from two different videos that systematically offer presence of objects simultaneously are very likely to correspond to each other. Then, we use the correspondence between cells to find the reformulated trajectory in the other videos. Finally, we rank the videos based on the visibility they offer. We show promising results by testing three aspects: the correspondence maps, the reformulation and the ranking.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.