2018
DOI: 10.48550/arxiv.1805.00545
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Weakly Supervised Attention Learning for Textual Phrases Grounding

Abstract: Grounding textual phrases in visual content is a meaningful yet challenging problem with various potential applications such as image-text inference or text-driven multimedia interaction. Most of the current existing methods adopt the supervised learning mechanism which requires groundtruth at pixel level during training. However, fine-grained level ground-truth annotation is quite time-consuming and severely narrows the scope for more general applications. In this extended abstract, we explore methods to loca… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2

Citation Types

0
2
0

Year Published

2019
2019
2020
2020

Publication Types

Select...
5

Relationship

2
3

Authors

Journals

citations
Cited by 5 publications
(2 citation statements)
references
References 19 publications
0
2
0
Order By: Relevance
“…Association Learning across Vision and Language is core and tie of a wide range of tasks across vision and language domains, e.g., textual grounding [38], referring expression comprehension [31] or object retrieval using language [16]. Recent works focus on leveraging the imagelevel annotations (as weak supervision) [8,9] or unsupervised method [64] to learn the association across language descriptions and objects. Proceeding from this, there arise works on using uncurated captions to learn temporal associations across video segments and texts [27,51].…”
Section: Related Workmentioning
confidence: 99%
“…Association Learning across Vision and Language is core and tie of a wide range of tasks across vision and language domains, e.g., textual grounding [38], referring expression comprehension [31] or object retrieval using language [16]. Recent works focus on leveraging the imagelevel annotations (as weak supervision) [8,9] or unsupervised method [64] to learn the association across language descriptions and objects. Proceeding from this, there arise works on using uncurated captions to learn temporal associations across video segments and texts [27,51].…”
Section: Related Workmentioning
confidence: 99%
“…State-of-the-art textual grounding methods [69,34,60,58,66,47,16] are based on deep neural networks and relying on large-scale training data with manual annotations for the object bounding box and relationship between phrases and figures/objects. This setup largely limits their broad applications as such strong supervision is expensive to obtain, and they also lack interpretability and resilience to counterfactual cases which do not appear in training.…”
Section: Related Workmentioning
confidence: 99%