Using Syntax to Ground Referring Expressions in Natural Images

Cirik, Volkan; Berg-Kirkpatrick, Taylor; Morency, Louis–Philippe

doi:10.1609/aaai.v32i1.12343

Cited by 28 publications

(8 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our evaluation metrics are slot tagging F1 score, and intent accuracy. Incorporating dependency parse information is known to improve compositional generalization of neural networks [27,28]. We test an advanced baseline model (BERT SLU + parse tree) which modifies the original attention scores in the final transformer layer with a weight inversely dependent on token distance on dependency tree.…”

Section: Settings and Baselinesmentioning

confidence: 99%

“…Compositional generalization has also been explored recently in multimodal setting for tasks such as robot navigation [22,42], VQA [43], and so on. Models using dependency parse information [27,28], graph based reasoning [44,45,25], and multi-task learning [46] have improved compositionality of neural network models. In this paper, we explore compositional generalization of SLU models based on transformer architecture, trained jointly for intent classification, and slot tagging tasks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Compositional Generalization in Spoken Language Understanding

Ray¹,

Shen²,

Jin³

2023

Interspeech 2023

View full text Add to dashboard Cite

State-of-the-art spoken language understanding (SLU) models have shown tremendous success in benchmark SLU datasets, yet they still fail in many practical scenario due to the lack of model compositionality when trained on limited training data. In this paper, we study two types of compositionality: novel slot combination, and length generalization. We first conduct in-depth analysis, and find that state-of-the-art SLU models often learn spurious slot correlations during training, which leads to poor performance in both compositional cases. To mitigate these limitations, we create the first compositional splits of benchmark SLU datasets and we propose the first compositional SLU model, including compositional loss and paired training that tackle each compositional case respectively. On both benchmark and compositional splits in ATIS and SNIPS, we show that our compositional SLU model significantly outperforms (up to 5% F1 score) state-of-the-art BERT SLU model.

show abstract

Section: Settings and Baselinesmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Compositional Generalization in Spoken Language Understanding

Ray¹,

Shen²,

Jin³

2023

Interspeech 2023

View full text Add to dashboard Cite

show abstract

“…We follow the commonly adopted definition of REs put forward by computational linguistics and natural language processing (e.g., [36]), and consider a (noun) phrase as a RE if it is an accurate description of the referent, but not of any other object in the current scene. Likewise, in the vision & language research field, visual RE resolution and generation has seen a rise of interest, especially in still images [8,28,30,31,50], and more recently also on videos [1,6]. The task is formulated as, given an instance comprising an image or video with one or multiple objects, and a RE, identify the referent that the RE describes by predicting, e.g., its bounding box or segmentation mask.…”

Section: Referring Expression Categorizationmentioning

confidence: 99%

A closer look at referring expressions for video object segmentation

Bellver

Ventura

Silberer

et al. 2022

Multimed Tools Appl

View full text Add to dashboard Cite

The task of Language-guided Video Object Segmentation (LVOS) aims at generating binary masks for an object referred by a linguistic expression. When this expression unambiguously describes an object in the scene, it is named referring expression (RE). Our work argues that existing benchmarks used for LVOS are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the referring expressions in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, where the non-trivial REs are further annotated with seven RE semantic categories. We leverage these data to analyze the performance of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for LVOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.

show abstract

“…In view of this, Yu et al [143] proposed a more general modular based method named Modular Attention Network (MAttNet) for adaptive modeling the input expression by language based attention and visual attention. Based on MattNet, Liu et al [144] designed an erasing approach named Cross-Modal Attention-Guided Erasing (CM-Att-Erase) for better textual-visual correspondences.…”

Section: Localization-based Models P(r|s I) R C R P(r|s I)mentioning

confidence: 99%

A Survey of Vision and Language Related Multi-Modal Task

Wang¹,

Hu²,

Qiu³

et al. 2022

CAAI Artificial Intelligence Research

View full text Add to dashboard Cite

With the significant breakthrough in the research of single-modal related deep learning tasks, more and more works begin to focus on multi-modal tasks. Multi-modal tasks usually involve more than one different modalities, and a modality represents a type of behavior or state. Common multi-modal information includes vision, hearing, language, touch, and smell. Vision and language are two of the most common modalities in human daily life, and many typical multi-modal tasks focus on these two modalities, such as visual captioning and visual grounding. In this paper, we conduct in-depth research on typical tasks of vision and language from the perspectives of generation, analysis, and reasoning. First, the analysis and summary with the typical tasks and some pretty classical methods are introduced, which will be generalized from the aspects of different algorithmic concerns, and be further discussed frequently used datasets and metrics. Then, some other variant tasks and cutting-edge tasks are briefly summarized to build a more comprehensive vision and language related multi-modal tasks framework. Finally, we further discuss the development of pre-training related research and make an outlook for future research. We hope this survey can help relevant researchers to understand the latest progress, existing problems, and exploration directions of vision and language multi-modal related tasks, and provide guidance for future research.

show abstract

Using Syntax to Ground Referring Expressions in Natural Images

Cited by 28 publications

References 30 publications

Compositional Generalization in Spoken Language Understanding

Compositional Generalization in Spoken Language Understanding

A closer look at referring expressions for video object segmentation

A Survey of Vision and Language Related Multi-Modal Task

Contact Info

Product

Resources

About