Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Xu, Huijuan; He, Kun; Plummer, Bryan A.; Sigal, Leonid; Sclaroff, Stan; Saenko, Kate

doi:10.1609/aaai.v33i01.33019062

Cited by 301 publications

(235 citation statements)

References 11 publications

Supporting

Mentioning

228

Contrasting

Unclassified

Order By: Relevance

“…Also, models that more carefully consider the effect of each word in a caption may benefit more from our improved features (e.g. [41,60] these vision-language tasks. Visual Word2Vec performs comparably amongst results for generation tasks (i.e.…”

Section: Resultsmentioning

confidence: 99%

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Burns

Tan

Saenko

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

Shouldn't language and vision features be treated equally in vision-language (VL) tasks? Many VL approaches treat the language component as an afterthought, using simple language models that are either built upon fixed word embeddings trained on text-only data or are learned from scratch. We believe that language features deserve more attention, and conduct experiments which compare different word embeddings, language models, and embedding augmentation steps on five common VL tasks: image-sentence retrieval, image captioning, visual question answering, phrase grounding, and text-to-clip retrieval. Our experiments provide some striking results; an average embedding language model outperforms an LSTM on retrieval-style tasks; state-of-the-art representations such as BERT perform relatively poorly on vision-language tasks. From this comprehensive set of experiments we propose a set of best practices for incorporating the language component of VL tasks. To further elevate language features, we also show that knowledge in vision-language problems can be transferred across tasks to gain performance with multi-task training. This multi-task training is applied to a new Graph Oriented Vision-Language Embedding (GrOVLE), which we adapt from Word2Vec using WordNet and an original visual-language graph built from Visual Genome, providing a ready-to-use vision-language embedding:

show abstract

Section: Resultsmentioning

confidence: 99%

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Burns

Tan

Saenko

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

Self Cite

View full text Add to dashboard Cite

show abstract

“…Early works study this task in constrained settings, including the fixed spatial prepositions [21,38], instruction videos [1,31,35] and ordering constraint [4,37]. Recently, unconstrained query-based moment retrieval has attracted a lot of attention [6,10,13,14,22,23,42]. These methods are mainly based on a sliding window framework, which first samples candidate moments and then ranks these moments.…”

Section: Query-based Moment Retrievalmentioning

confidence: 99%

“…That is, each frame is not only relevant to adjacent frames, but also associated with distant ones. Existing approaches often apply RNN-based temporal modeling [6], or propose R-C3D networks to learn spatiotemporal representations from raw video streams [42]. Although these methods are able to absorb contextual information for each frame, they still fail to build direct interactions between distant frames.…”

Section: Introductionmentioning

confidence: 99%

“…Early approaches [10,13,14] ignore this factor and only simply combine the query and moment features for correlation estimations. Although recent methods [6,22,23,42] have developed a cross-modal interaction by widely-used attention mechanism, they still remain in the rough one-stage interaction, for example, highlighting the crucial context information of moments by the guidance of queries [22]. Different from previous works, we adopt a multi-stage cross-modal interaction method to further exploit the potential relation of video and query contents.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Zhang

Lin

Zhao

et al. 2019

Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval

214

152

View full text Add to dashboard Cite

Query-based moment retrieval aims to localize the most relevant moment in an untrimmed video according to the given natural language query. Existing works often only focus on one aspect of this emerging task, such as the query representation learning, video context modeling or multi-modal fusion, thus fail to develop a comprehensive system for further performance improvement. In this paper, we introduce a novel Cross-Modal Interaction Network (CMIN) to consider multiple crucial factors for this challenging task, including (1) the syntactic structure of natural language queries; (2) long-range semantic dependencies in video context and (3) the sufficient cross-modal interaction. Specifically, we devise a syntactic GCN to leverage the syntactic structure of queries for fine-grained representation learning, propose a multi-head self-attention to capture long-range semantic dependencies from video context, and next employ a multi-stage cross-modal interaction to explore the potential relations of video and query contents. The extensive experiments demonstrate the effectiveness of our proposed method. Our core code has been released at https://github.com/ikuinen/CMIN. CCS CONCEPTS• Information systems → Novelty in information retrieval. KEYWORDSQuery-based moment retrieval; syntactic GCN; multi-head selfattention; multi-stage cross-modal interaction ACM Reference Format:

show abstract

“…Text-Image Matching: Learning cross-modal embeddings has numerous applications [61,69] ranging from PINs using facial and voice information [37], to generative feature learning [15] and domain adaptation [63,65]. Nagrani et al [37] demonstrated that a joint representation can be learned from facial and voice information and introduced a curriculum learning strategy [3,45,46] to perform hard negative mining during training.…”

Section: Related Workmentioning

confidence: 99%

Adversarial Representation Learning for Text-to-Image Matching

Sarafianos

Kakadiaris

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

182

View full text Add to dashboard Cite

For many computer vision applications such as image captioning, visual question answering, and person search, learning discriminative feature representations at both image and text level is an essential yet challenging problem. Its challenges originate from the large word variance in the text domain as well as the difficulty of accurately measuring the distance between the features of the two modalities. Most prior work focuses on the latter challenge, by introducing loss functions that help the network learn better feature representations but fail to account for the complexity of the textual input. With that in mind, we introduce TIMAM: a Text-Image Modality Adversarial Matching approach that learns modality-invariant feature representations using adversarial and cross-modal matching objectives. In addition, we demonstrate that BERT, a publiclyavailable language model that extracts word embeddings, can successfully be applied in the text-to-image matching domain. The proposed approach achieves state-of-theart cross-modal matching performance on four widely-used publicly-available datasets resulting in absolute improvements ranging from 2% to 5% in terms of rank-1 accuracy.

show abstract

Multilevel Language and Vision Integration for Text-to-Clip Retrieval

Cited by 301 publications

References 11 publications

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Language Features Matter: Effective Language Representations for Vision-Language Tasks

Cross-Modal Interaction Networks for Query-Based Moment Retrieval in Videos

Adversarial Representation Learning for Text-to-Image Matching

Contact Info

Product

Resources

About