2018
DOI: 10.1609/aaai.v32i1.12343
|View full text |Cite
|
Sign up to set email alerts
|

Using Syntax to Ground Referring Expressions in Natural Images

Abstract: We introduce GroundNet, a neural network for referring expression recognition---the task of localizing (or grounding) in an image the object referred to by a natural language expression. Our approach to this task is the first to rely on a syntactic analysis of the input referring expression in order to inform the structure of the computation graph. Given a parse tree for an input expression, we explicitly map the syntactic constituents and relationships present in the tree to a composed graph of neural modules… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
8
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
6
2

Relationship

0
8

Authors

Journals

citations
Cited by 28 publications
(8 citation statements)
references
References 30 publications
0
8
0
Order By: Relevance
“…Our evaluation metrics are slot tagging F1 score, and intent accuracy. Incorporating dependency parse information is known to improve compositional generalization of neural networks [27,28]. We test an advanced baseline model (BERT SLU + parse tree) which modifies the original attention scores in the final transformer layer with a weight inversely dependent on token distance on dependency tree.…”
Section: Settings and Baselinesmentioning
confidence: 99%
See 1 more Smart Citation
“…Our evaluation metrics are slot tagging F1 score, and intent accuracy. Incorporating dependency parse information is known to improve compositional generalization of neural networks [27,28]. We test an advanced baseline model (BERT SLU + parse tree) which modifies the original attention scores in the final transformer layer with a weight inversely dependent on token distance on dependency tree.…”
Section: Settings and Baselinesmentioning
confidence: 99%
“…Compositional generalization has also been explored recently in multimodal setting for tasks such as robot navigation [22,42], VQA [43], and so on. Models using dependency parse information [27,28], graph based reasoning [44,45,25], and multi-task learning [46] have improved compositionality of neural network models. In this paper, we explore compositional generalization of SLU models based on transformer architecture, trained jointly for intent classification, and slot tagging tasks.…”
Section: Related Workmentioning
confidence: 99%
“…We follow the commonly adopted definition of REs put forward by computational linguistics and natural language processing (e.g., [36]), and consider a (noun) phrase as a RE if it is an accurate description of the referent, but not of any other object in the current scene. Likewise, in the vision & language research field, visual RE resolution and generation has seen a rise of interest, especially in still images [8,28,30,31,50], and more recently also on videos [1,6]. The task is formulated as, given an instance comprising an image or video with one or multiple objects, and a RE, identify the referent that the RE describes by predicting, e.g., its bounding box or segmentation mask.…”
Section: Referring Expression Categorizationmentioning
confidence: 99%
“…In view of this, Yu et al [143] proposed a more general modular based method named Modular Attention Network (MAttNet) for adaptive modeling the input expression by language based attention and visual attention. Based on MattNet, Liu et al [144] designed an erasing approach named Cross-Modal Attention-Guided Erasing (CM-Att-Erase) for better textual-visual correspondences.…”
Section: Localization-based Models P(r|s I) R C R P(r|s I)mentioning
confidence: 99%