2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition 2018
DOI: 10.1109/cvpr.2018.00142
|View full text |Cite
|
Sign up to set email alerts
|

MAttNet: Modular Attention Network for Referring Expression Comprehension

Abstract: In this paper, we address referring expression comprehension: localizing an image region described by a natural language expression. While most recent work treats expressions as a single unit, we propose to decompose them into three modular components related to subject appearance, location, and relationship to other objects. This allows us to flexibly adapt to expressions containing different types of information in an end-to-end framework. In our model, which we call the Modular Attention Network (MAttNet), … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
754
1

Year Published

2018
2018
2019
2019

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 733 publications
(780 citation statements)
references
References 33 publications
3
754
1
Order By: Relevance
“…The other is the joint vision-language embedding framework to model P(q, r ). During training, the supervision is object proposal and referring expression pairs (r i , q i ) [3,20,25,34,38,46]. The relationship between the target entity and context entity is often used to assist grounding the target in supervised REG methods [15,19,23,30,32,49].…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The other is the joint vision-language embedding framework to model P(q, r ). During training, the supervision is object proposal and referring expression pairs (r i , q i ) [3,20,25,34,38,46]. The relationship between the target entity and context entity is often used to assist grounding the target in supervised REG methods [15,19,23,30,32,49].…”
Section: Related Workmentioning
confidence: 99%
“…Thus we also add an attribute classification branch in our model. The attribute label is extracted through an external language parser [17] according to [46]. Subject feature r i s of proposal is used for attribute classification.…”
Section: Attribute Classification Lossmentioning
confidence: 99%
“…[7] decomposes the expression into subjectrelationship-object triplets and aligns the textual representations with image regions using localization module or relationship module; however, referring expressions have much richer forms than this fixed subject-relationship-object template. MattNet [29] decomposes the expressions into three phrases which are corresponding to the subject, location and relationship modules respectively; however, it cannot process multi-step reasoning. The other work [32] enables reasoning as a step-wise attention process following the stepwise representation of the expression; however, it treats the expression as the sequence of words, which ignores the linguistic structure of the expression.…”
Section: Interpretable Reasoningmentioning
confidence: 99%
“…Referring expressions are complex, and include rich dependencies and nested linguistic structures, which further guide the visual reasoning process. In theory, natural language parsers can parse grammatical relations among the words in an expression, but existing language parsers are not practical for referring expression comprehension due to highly unrestricted language [29]. Each complex expression is defined by its constituent expressions and the rules used to combine them.…”
Section: Language-guided Visual Reasoning Processmentioning
confidence: 99%
“…The predicted region corresponds to the caption that is ranked most similar to the referring expression. Other works based on RPN (Hu et al, 2017) or Faster-RCNN (Yu et al, 2018) have integrated attention mechanisms to decompose the language expressions into multiple sub-parts but use tailored modules for specific sub-tasks making them less fit for our object referral task. Karpathy et al (2014) interpret the inner product between region proposals and sentence fragments as a similarity score, allowing to match them in a bidirectional manner.…”
Section: Related Workmentioning
confidence: 99%