2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2015
DOI: 10.1109/cvpr.2015.7298994
|View full text |Cite
|
Sign up to set email alerts
|

Multi-task deep visual-semantic embedding for video thumbnail selection

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
130
0

Year Published

2018
2018
2019
2019

Publication Types

Select...
5
4
1

Relationship

0
10

Authors

Journals

citations
Cited by 220 publications
(130 citation statements)
references
References 19 publications
0
130
0
Order By: Relevance
“…Of particular interest are applications of multimodal data using very deep networks. Recent research shows a deep semantic mapping between text and images [40] which motivates the use of knowledge extraction from, say, a text modality as relevant for the explanation of context in the image modality. Another direction for future work includes the parallel implementation of the proposed knowledge insertion and extraction algorithms, and its use in the iterative evaluation of very large networks, as part of a cycle of knowledge acquisition and revision.…”
Section: Discussionmentioning
confidence: 99%
“…Of particular interest are applications of multimodal data using very deep networks. Recent research shows a deep semantic mapping between text and images [40] which motivates the use of knowledge extraction from, say, a text modality as relevant for the explanation of context in the image modality. Another direction for future work includes the parallel implementation of the proposed knowledge insertion and extraction algorithms, and its use in the iterative evaluation of very large networks, as part of a cycle of knowledge acquisition and revision.…”
Section: Discussionmentioning
confidence: 99%
“…The matching relations between query words and frame contents are captured by a shallow dual cross-media relevance model [21] adapted from the image annotation problem. Liu et al [23] employed a deep visual-semantic embedding model (VSEM) to measure the relevance between the query and video frames by embedding them into a latent semantic space. Hence, key frames in the video are ranked by their distances to the given query in the learned latent space, and the top-ranked frames are selected as the final video thumbnail.…”
Section: Related Workmentioning
confidence: 99%
“…In order to improve relevancy, Gao et al proposed to utilize semantic information to select semantically representative frames as video thumbnails [1]. Liu et al developed a multi-task deep visual-semantic embedding model to automatically select query-dependent thumbnails based on both semantic and visual features [2]. Both methods heavily depend on using semantic information to guarantee the representativeness of the selection.…”
Section: Automated Thumbnail Selectionmentioning
confidence: 99%