When describing an image, people can rapidly extract the topic from the image and find the main object, generating sentences that match the main idea of the image. However, most of the scene graph generation methods do not emphasise the importance of the topic of the image. Consequently, the captions generated by the scene graph-based image captioning models cannot reflect the topic in the image then expressing the central idea of the image. In this paper, we propose a method for image captioning based on topic scene graphs (TSG). Firstly, we propose the structure of topic scene graphs that express images' topics and the relationships between objects. Then, combined with the topic scene graph, we utilise the salient object detection to generate the topic scene graph highlighting the salient objects of the image. Note that our framework is agnostic to any scene graphbased image captioning model and thus can be widely applied in the community which seeks salient object predictions. We compare the performance of our topic scene graph with the state-of-the-art scene graph generation models and mainstream image captioning models on MSCOCO and Visual Genome datasets, both achieving better performance.
People can accurately describe an image by constantly referring to the visual information and key text information of the image. Inspired by this idea, we propose the VTR-PTM (Visual-Text Reference Pretraining Model) for image captioning. First, based on the pretraining model (BERT/UNIML), we design the dual-stream input mode of image reference and text reference and use two different mask modes (bidirectional and sequence to sequence) to realize the VTR-PTM suitable for generating tasks. Second, the target dataset is used to fine tune the VTR-PTM. To the best of our knowledge, VTR-PTM is the first reported pretraining model to use visual-text references in the learning process. To evaluate the model, we conduct several experiments on the benchmark datasets of image captioning, including MS COCO and Visual Genome, and achieve significant improvements on most metrics. The code is available at https://github.com/lpfworld/VTR-PTM.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.