Proceedings of the 29th ACM International Conference on Multimedia 2021
DOI: 10.1145/3474085.3475439
|View full text |Cite
|
Sign up to set email alerts
|

Dual Graph Convolutional Networks with Transformer and Curriculum Learning for Image Captioning

Abstract: Existing image captioning methods just focus on understanding the relationship between objects or instances in a single image, without exploring the contextual correlation existed among contextual image. In this paper, we propose Dual Graph Convolutional Networks (Dual-GCN) with transformer and curriculum learning for image captioning. In particular, we not only use an object-level GCN to capture the object to object spatial relation within a single image, but also adopt an image-level GCN to capture the featu… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
6
4

Relationship

4
6

Authors

Journals

citations
Cited by 62 publications
(18 citation statements)
references
References 51 publications
0
18
0
Order By: Relevance
“…Dong et al. [63] proposed dual graph convolutional networks (Dual‐GCN) with transformer and curriculum learning to explore the contextual relevance between contextual images for image captioning, see Figure 9. Two independent GCNs encode the entire image and the objects from the image, and then the captions are generated by a Transformer linguistic decoder.…”
Section: The Recent Deep Learning Methodsmentioning
confidence: 99%
“…Dong et al. [63] proposed dual graph convolutional networks (Dual‐GCN) with transformer and curriculum learning to explore the contextual relevance between contextual images for image captioning, see Figure 9. Two independent GCNs encode the entire image and the objects from the image, and then the captions are generated by a Transformer linguistic decoder.…”
Section: The Recent Deep Learning Methodsmentioning
confidence: 99%
“…Transformer [11,44] has also been adapted to tackle the problem of human motion prediction [1,4]. Similar to GCN, the self-attention mechanism of Transformer can compute pairwise relations of joints.…”
Section: Related Workmentioning
confidence: 99%
“…Graph Convolution Network (GCN). Due to the higher representation power of graph structure, GCN has demonstrated superior performance in several tasks, including image caption [8], text to image and human pose estimation [4]. In 3D computer vision, Wald et al [40] proposed the first learning method that generated a semantic scene graph from a 3D point cloud.…”
Section: Related Workmentioning
confidence: 99%