2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00814
|View full text |Cite
|
Sign up to set email alerts
|

LAMV: Learning to Align and Match Videos with Kernelized Temporal Layers

Abstract: This paper considers a learnable approach for comparing and aligning videos. Our architecture builds upon and revisits temporal match kernels within neural networks: we propose a new temporal layer that finds temporal alignments by maximizing the scores between two sequences of vectors, according to a time-sensitive similarity metric parametrized in the Fourier domain. We learn this layer with a temporal proposal strategy, in which we minimize a triplet loss that takes into account both the localization accura… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

1
74
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 49 publications
(75 citation statements)
references
References 31 publications
1
74
0
Order By: Relevance
“…The distance between videos is determined by their Euclidean distance in the embedding space. In another work, Baraldi et al [2] introduced a temporal layer in a deep network that calculates the temporal alignment between videos. They trained the network minimizing the triplet loss that takes into account both localization accuracy and recognition rate.…”
Section: B Video Retrieval Methodsmentioning
confidence: 99%
“…The distance between videos is determined by their Euclidean distance in the embedding space. In another work, Baraldi et al [2] introduced a temporal layer in a deep network that calculates the temporal alignment between videos. They trained the network minimizing the triplet loss that takes into account both localization accuracy and recognition rate.…”
Section: B Video Retrieval Methodsmentioning
confidence: 99%
“…Regarding the time correspondence, it is assumed to be a constant time shift, C(t q ) = t q + β, where t q and C(t q ) are the query frame and its corresponding one in the context video, respectively [7], [24]- [28], [34]- [36], or a linear relation, C(t q ) = α t q + β, to consider the different frame rates of the input videos in [3], [8], [22] and [30]- [33]. On the other hand, many works adopted a free-form of the temporal correspondence [1], [2], [4], [9], [23], [29] and [32].…”
Section: B Video Alignmentmentioning
confidence: 99%
“…Typical frame-level approaches [48,10,32,22,54] calculate the frame-by-frame similarity and then employ sequence alignment algorithms to compute similarity at the video level. Moreover, a lot of research effort has been invested in methods that exploit spatio-temporal features to represent video segments in order to facilitate videolevel similarity computation [15,56,38,37,3]. Table 4.2 displays the performance of four approaches on the VCDB dataset.…”
Section: Frame-level Matchingmentioning
confidence: 99%
“…relative time offset by considering all possible relative timestamps. Baraldi et al [3] built a deep learning layer component based on TMK and set up a training process to learn the feature transform coefficients in the Fourier domain. A triplet loss that takes into account both the video similarity score and the temporal alignment was used in order to train the proposed network.…”
Section: Researchmentioning
confidence: 99%