LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers

Baraldi, Lorenzo; Douze, Matthijs; Cucchiara, Rita

doi:10.1109/cvpr.2018.00814

Cited by 49 publications

(75 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The distance between videos is determined by their Euclidean distance in the embedding space. In another work, Baraldi et al [2] introduced a temporal layer in a deep network that calculates the temporal alignment between videos. They trained the network minimizing the triplet loss that takes into account both localization accuracy and recognition rate.…”

Section: B Video Retrieval Methodsmentioning

confidence: 99%

FIVR: Fine-Grained Incident Video Retrieval

Kordopatis-Zilos¹,

Papadopoulos²,

Patras³

et al. 2019

IEEE Trans. Multimedia

View full text Add to dashboard Cite

This paper introduces the problem of Fine-grained Incident Video Retrieval (FIVR). Given a query video, the objective is to retrieve all associated videos, considering several types of associations that range from duplicate videos to videos from the same incident. FIVR offers a single framework that contains several retrieval tasks as special cases. To address the benchmarking needs of all such tasks, we construct and present a large-scale annotated video dataset, which we call FIVR-200K, and it comprises 225,960 videos. To create the dataset, we devise a process for the collection of YouTube videos based on major news events from recent years crawled from Wikipedia and deploy a retrieval pipeline for the automatic selection of query videos based on their estimated suitability as benchmarks. We also devise a protocol for the annotation of the dataset with respect to the four types of video associations defined by FIVR. Finally, we report the results of an experimental study on the dataset comparing five state-of-the-art methods developed based on a variety of visual descriptors, highlighting the challenges of the current problem.Index Terms-incident video retrieval, near-duplicate videos, video retrieval, video dataset

show abstract

Section: B Video Retrieval Methodsmentioning

confidence: 99%

FIVR: Fine-Grained Incident Video Retrieval

Kordopatis-Zilos¹,

Papadopoulos²,

Patras³

et al. 2019

IEEE Trans. Multimedia

View full text Add to dashboard Cite

show abstract

“…Regarding the time correspondence, it is assumed to be a constant time shift, C(t q ) = t q + β, where t q and C(t q ) are the query frame and its corresponding one in the context video, respectively [7], [24]- [28], [34]- [36], or a linear relation, C(t q ) = α t q + β, to consider the different frame rates of the input videos in [3], [8], [22] and [30]- [33]. On the other hand, many works adopted a free-form of the temporal correspondence [1], [2], [4], [9], [23], [29] and [32].…”

Section: B Video Alignmentmentioning

confidence: 99%

Video Alignment Using Bi-Directional Attention Flow in a Multi-Stage Learning Model

Abobeah

Shoukry

Katto³

2020

IEEE Access

View full text Add to dashboard Cite

Recently, deep learning techniques have contributed to solving a multitude of computer vision tasks. In this paper, we propose a deep-learning approach for video alignment, which involves finding the best correspondences between two overlapping videos. We formulate the video alignment task as a variant of the well-known machine comprehension (MC) task in natural language processing. While MC answers a question about a given paragraph, our technique determines the most relevant frame sequence in the context video to the query video. This is done by representing the individual frames of the two videos by highly discriminative and compact descriptors. Next, the descriptors are fed into a multi-stage network that is able, with the help of the bidirectional attention flow mechanism, to represent the context video at various granularity levels besides estimating the query-aware context part. The proposed model was trained on 10k video-pairs collected from ''YouTube''. The obtained results show that our model outperforms all known state of the art techniques by a considerable margin, confirming its efficacy. INDEX TERMS Bi-directional attention, temporal alignment, video retrieval, video synchronization, video alignment. REHAM ABOBEAH received the B.Sc. and M.Sc. degrees from the Computers and Systems Engineering Department, Faculty of Engineering, Al-Azhar University. She is currently pursuing the Ph.D. degree with the

show abstract

“…Typical frame-level approaches [48,10,32,22,54] calculate the frame-by-frame similarity and then employ sequence alignment algorithms to compute similarity at the video level. Moreover, a lot of research effort has been invested in methods that exploit spatio-temporal features to represent video segments in order to facilitate videolevel similarity computation [15,56,38,37,3]. Table 4.2 displays the performance of four approaches on the VCDB dataset.…”

Section: Frame-level Matchingmentioning

confidence: 99%

“…relative time offset by considering all possible relative timestamps. Baraldi et al [3] built a deep learning layer component based on TMK and set up a training process to learn the feature transform coefficients in the Fourier domain. A triplet loss that takes into account both the video similarity score and the temporal alignment was used in order to train the proposed network.…”

Section: Researchmentioning

confidence: 99%

Finding Near-Duplicate Videos in Large-Scale Collections

Kordopatis-Zilos

Papadopoulos

Patras

et al. 2019

Video Verification in the Fake News Era

View full text Add to dashboard Cite

This chapter discusses the problem of Near-Duplicate Video Retrieval (NDVR). The main objective of a typical NDVR approach is: given a query video, retrieve all near-duplicate videos in a video repository and rank them based on their similarity to the query. Several approaches have been introduced in the literature, which can be roughly classified in three categories based on the level of video matching, i.e. (i) video-level, (ii) frame-level and (iii) filter-and-refine matching. Two methods based on video-level matching are presented in this chapter. The first is an unsupervised scheme that relies on a modified Bag-of-Word (BoW) video representation. The second is a supervised method based on Deep Metric Learning (DML). For the development of both methods, features are extracted from the intermediate layers of Convolutional Neural Networks and leveraged as frame descriptors, since they offer a compact and informative image representation, and lead to increased system efficiency. Extensive evaluation has been conducted on publicly available benchmark datasets, and the presented methods are compared with state-of-art approaches, achieving the best results in all evaluation setups.

show abstract

LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers

Cited by 49 publications

References 31 publications

FIVR: Fine-Grained Incident Video Retrieval

FIVR: Fine-Grained Incident Video Retrieval

Video Alignment Using Bi-Directional Attention Flow in a Multi-Stage Learning Model

Finding Near-Duplicate Videos in Large-Scale Collections

Contact Info

Product

Resources

About