Deep learning for content-based video retrieval in film and television production

Mühling, Markus; Korfhage, Nikolaus; Müller, Eric M.; Otto, Christian; Springstein, Matthias; Langelage, Thomas; Veith, Uli; Ewerth, Ralph; Freisleben, Bernd

doi:10.1007/s11042-017-4962-9

Cited by 29 publications

(15 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Remember that an AP score of 1 means that the k relevant videos for a query are returned exactly in the top k positions in the ranked video list, which is the ideal result. The ResNet-101 configuration achieves a similar number of queries (21,611) with AP higher than 90%, but also obtains a very low AP, less than 10%, for 451 queries ( $ 2% of the total), which is the main cause for the overall lower MAP. This is also evident in Fig.…”

Section: Experiments On the Test Setmentioning

confidence: 98%

See 1 more Smart Citation

A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos

Ciaparrone

Chiariglione

Tagliaferri

2022

Neural Comput & Applic

View full text Add to dashboard Cite

Face-based video retrieval (FBVR) is the task of retrieving videos that containing the same face shown in the query image. In this article, we present the first end-to-end FBVR pipeline that is able to operate on large datasets of unconstrained, multi-shot, multi-person videos. We adapt an existing audiovisual recognition dataset to the task of FBVR and use it to evaluate our proposed pipeline. We compare a number of deep learning models for shot detection, face detection, and face feature extraction as part of our pipeline on a validation dataset made of more than 4000 videos. We obtain 97.25% mean average precision on an independent test set, composed of more than 1000 videos. The pipeline is able to extract features from videos at $$\sim $$ ∼ 7 times the real-time speed, and it is able to perform a query on thousands of videos in less than 0.5 s.

show abstract

Section: Experiments On the Test Setmentioning

confidence: 98%

“…Mu ¨hling et al [21] presented a system able to perform video search based on textual descriptions or face images, in addition to face identification and clustering. The authors used Faster R-CNN [22] for face detection and another CNN [23] for feature extraction.…”

Section: Face Video Retrieval and Related Tasksmentioning

confidence: 99%

A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos

Ciaparrone

Chiariglione

Tagliaferri

2022

Neural Comput & Applic

View full text Add to dashboard Cite

show abstract

“…The approach has reached good performance in city. Mühling et al use deep learning method for video content retrieval in films and TV programs, and achieve high retrieval rate in those videos [74]. Hu et al construct a deep incremental slow feature analysis (D-IncSFA) network, to implement video anomaly detection, which relies on hand-crafted representations [75].…”

Section: Mixed-stage Video Object Detectionmentioning

confidence: 99%

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

Wang

Chen

et al. 2021

Micromachines

View full text Add to dashboard Cite

Video object and human action detection are applied in many fields, such as video surveillance, face recognition, etc. Video object detection includes object classification and object location within the frame. Human action recognition is the detection of human actions. Usually, video detection is more challenging than image detection, since video frames are often more blurry than images. Moreover, video detection often has other difficulties, such as video defocus, motion blur, part occlusion, etc. Nowadays, the video detection technology is able to implement real-time detection, or high-accurate detection of blurry video frames. In this paper, various video object and human action detection approaches are reviewed and discussed, many of them have performed state-of-the-art results. We mainly review and discuss the classic video detection methods with supervised learning. In addition, the frequently-used video object detection and human action recognition datasets are reviewed. Finally, a summarization of the video detection is represented, e.g., the video object and human action detection methods could be classified into frame-by-frame (frame-based) detection, extracting-key-frame detection and using-temporal-information detection; the methods of utilizing temporal information of adjacent video frames are mainly the optical flow method, Long Short-Term Memory and convolution among adjacent frames.

show abstract

“…Other works have been studding visual semantics for large scale annotation, like [32,35,24]. Most recent works approach the problem with deep learning schemes which prove great performance [22,34]. However, CBVR methods require a lot of computational resources and are sometimes not feasible for large scale and real time applications as the one targeting in this work.…”

Section: Related Workmentioning

confidence: 99%

ViTS: Video Tagging System from Massive Web Multimedia Collections

Fernández

Varas

Espadaler³

et al. 2017

2017 IEEE International Conference on Computer Vision Workshops (ICCVW)

View full text Add to dashboard Cite

The popularization of multimedia content on the Web has arised the need to automatically understand, index and retrieve it. In this paper we present ViTS, an automatic Video Tagging System which learns from videos, their web context and comments shared on social networks. ViTS analyses massive multimedia collections by Internet crawling, and maintains a knowledge base that updates in real time with no need of human supervision. As a result, each video is indexed with a rich set of labels and linked with other related contents. ViTS is an industrial product under exploitation with a vocabulary of over 2.5M concepts, capable of indexing more than 150k videos per month. We compare the quality and completeness of our tags with respect to the ones in the YouTube-8M dataset, and we show how ViTS enhances the semantic annotation of the videos with a larger number of labels (10.04 tags/video), with an accuracy of 80,87%. Extracted tags and video summaries are publicly available. 1

show abstract

Deep learning for content-based video retrieval in film and television production

Cited by 29 publications

References 29 publications

A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos

A comparison of deep learning models for end-to-end face-based video retrieval in unconstrained videos

Visual Feature Learning on Video Object and Human Action Detection: A Systematic Review

ViTS: Video Tagging System from Massive Web Multimedia Collections

Contact Info

Product

Resources

About