Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.
CCS CONCEPTS• Computing methodologies → Image representations; Natural language generation.