2019
DOI: 10.1609/aaai.v33i01.33019062
|View full text |Cite
|
Sign up to set email alerts
|

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Abstract: We address the problem of text-based activity retrieval in video. Given a sentence describing an activity, our task is to retrieve matching clips from an untrimmed video. To capture the inherent structures present in both text and video, we introduce a multilevel model that integrates vision and language features earlier and more tightly than prior work. First, we inject text features early on when generating clip proposals, to help eliminate unlikely clips and thus speed up processing and boost performance. S… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
228
0
2

Year Published

2019
2019
2020
2020

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 301 publications
(235 citation statements)
references
References 11 publications
0
228
0
2
Order By: Relevance
“…Also, models that more carefully consider the effect of each word in a caption may benefit more from our improved features (e.g. [41,60] these vision-language tasks. Visual Word2Vec performs comparably amongst results for generation tasks (i.e.…”
Section: Resultsmentioning
confidence: 99%
“…Also, models that more carefully consider the effect of each word in a caption may benefit more from our improved features (e.g. [41,60] these vision-language tasks. Visual Word2Vec performs comparably amongst results for generation tasks (i.e.…”
Section: Resultsmentioning
confidence: 99%
“…Early works study this task in constrained settings, including the fixed spatial prepositions [21,38], instruction videos [1,31,35] and ordering constraint [4,37]. Recently, unconstrained query-based moment retrieval has attracted a lot of attention [6,10,13,14,22,23,42]. These methods are mainly based on a sliding window framework, which first samples candidate moments and then ranks these moments.…”
Section: Query-based Moment Retrievalmentioning
confidence: 99%
“…That is, each frame is not only relevant to adjacent frames, but also associated with distant ones. Existing approaches often apply RNN-based temporal modeling [6], or propose R-C3D networks to learn spatiotemporal representations from raw video streams [42]. Although these methods are able to absorb contextual information for each frame, they still fail to build direct interactions between distant frames.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Text-Image Matching: Learning cross-modal embeddings has numerous applications [61,69] ranging from PINs using facial and voice information [37], to generative feature learning [15] and domain adaptation [63,65]. Nagrani et al [37] demonstrated that a joint representation can be learned from facial and voice information and introduced a curriculum learning strategy [3,45,46] to perform hard negative mining during training.…”
Section: Related Workmentioning
confidence: 99%