2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2019
DOI: 10.1109/cvpr.2019.01276
|View full text |Cite
|
Sign up to set email alerts
|

Multi-Level Multimodal Common Semantic Space for Image-Phrase Grounding

Abstract: We address the problem of phrase grounding by learning a multi-level common semantic space shared by the textual and visual modalities. We exploit multiple levels of feature maps of a Deep Convolutional Neural Network, as well as contextualized word and sentence embeddings extracted from a character-based language model. Following dedicated non-linear mappings for visual features at each level, word, and sentence embeddings, we obtain multiple instantiations of our common semantic space in which comparisons be… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
66
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 65 publications
(66 citation statements)
references
References 51 publications
0
66
0
Order By: Relevance
“…For every image, we select the top 30 RoIs based on Faster-RCNN's class detection score (after non-maximal suppre- sion and thresholding). 2 We then use RoIAlign [17] to extract features (d v = 2048-d) for each of these RoIs using a ResNet-152 model pre-trained on ImageNet [18].…”
Section: Implementation Detailsmentioning
confidence: 99%
See 1 more Smart Citation
“…For every image, we select the top 30 RoIs based on Faster-RCNN's class detection score (after non-maximal suppre- sion and thresholding). 2 We then use RoIAlign [17] to extract features (d v = 2048-d) for each of these RoIs using a ResNet-152 model pre-trained on ImageNet [18].…”
Section: Implementation Detailsmentioning
confidence: 99%
“…For phrase localization, we outperform the previous state-of-the-art [13], which uses a variant of the Global method with a novel spatial pooling step, by 4.9% based on the P ointIt% metric on VG. On Flickr30k Entities, we out-perform prior stateof-the-art [2] with much simpler encoders (ResNet+bi-GRU v/s PNASNet+ElMo). For the caption-to-image retrieval task, we also achieve state-of-the-art performance (R@1 of 49.7 vs. 48.6 by [25]) on Flickr30k dataset and get competitive results relative to state-of-the-art (R@1 of 56.6 vs. 58.8 by [25]) on COCO dataset for the downstream C2I task.…”
Section: Comparison With State-of-the-artmentioning
confidence: 99%
“…Inspired by the work of [2] and [12], we take the approach of using multiple autoencoders [53] that each learn a common feature space for multiple modalities. Assuming that the common feature space has a dimensionality of , the mapping function in the encoder for each autoencoder ( ) outputs a vector with dimensionality of .…”
Section: Mapping Inputs To a Common Feature Spacementioning
confidence: 99%
“…(2)ĉ E (1) E (2) D (1) D (2) Figure 2: An illustration of the mapping module with two modalities. Figure 2 provides an illustration with an example of two modalities mapped into a common feature space and then reconstructed, based on two autoencoders.…”
Section: ( )mentioning
confidence: 99%
“…Phrase grounding. Most existing works, if not all, can be categorized into two types: one is attention-based [2,4] and embeddingbased [10,29]. The former type treats the input text query as key to generate a pixel-level or coarser attention map and use this map to further generate bounding boxes.…”
Section: Related Workmentioning
confidence: 99%