Proceedings of the 28th ACM International Conference on Multimedia 2020
DOI: 10.1145/3394171.3413877
|View full text |Cite
|
Sign up to set email alerts
|

Improving Intra- and Inter-Modality Visual Relation for Image Captioning

Abstract: It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra-and Inter-modality visual Relation Transformer to improve connections among visual features, termed 2. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature represen… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
8
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
3
1

Relationship

1
8

Authors

Journals

citations
Cited by 18 publications
(8 citation statements)
references
References 26 publications
0
8
0
Order By: Relevance
“…Object relationship has been employed in a few computer vision tasks such as scene graph generation [32]. It encodes the interplay between the object instances to achieve highlevel image understanding to facilitate applications such as image capturing [33] and manipulation [34]. However, such kind of high level relationship is too general and lacks of discrimination for the low-level camera relocalization task at hand.…”
Section: B Object Related Methodsmentioning
confidence: 99%
“…Object relationship has been employed in a few computer vision tasks such as scene graph generation [32]. It encodes the interplay between the object instances to achieve highlevel image understanding to facilitate applications such as image capturing [33] and manipulation [34]. However, such kind of high level relationship is too general and lacks of discrimination for the low-level camera relocalization task at hand.…”
Section: B Object Related Methodsmentioning
confidence: 99%
“…Wang et al. [36] further refined the modelling of intra‐ and inter‐modality visual relations in image captioning. Cornia et al.…”
Section: Related Workmentioning
confidence: 99%
“…Huang et al [8], Pan et al [18] rely on variants of attention block to find interactions between multi-modal inputs. Cornia et al [7], Jiayi et al [9], Wang et al [28] use the popular transformer to encode inter-and intra-modal relations by self-attention. Following the success of graphical methods in computer vision community, some researchers adopt scene graph to encode visual relationships with the help of GCNs, which will be talked next.…”
Section: Related Work 21 Image Captioningmentioning
confidence: 99%
“…We compare our models with SCST [21], Up-Down [2], GCN-LSTM [31], VSUA [15], SGAE [30], VRD [22], I 2 RT [28], MMT [7] and GET [9]. Among them, SCST uses reinforcement learning to train the model using its own inference.…”
Section: Quantitative Analysismentioning
confidence: 99%