2018
DOI: 10.1007/978-3-030-01216-8_31
|View full text |Cite
|
Sign up to set email alerts
|

Recurrent Fusion Network for Image Captioning

Abstract: Recently, much advance has been made in image captioning, and an encoder-decoder framework has been adopted by all the stateof-the-art models. Under this framework, an input image is encoded by a convolutional neural network (CNN) and then translated into natural language with a recurrent neural network (RNN). The existing models counting on this framework merely employ one kind of CNNs, e.g., ResNet or Inception-X, which describe image contents from only one specific view point. Thus, the semantic meaning of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
131
0

Year Published

2019
2019
2022
2022

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 258 publications
(131 citation statements)
references
References 53 publications
0
131
0
Order By: Relevance
“…We report the performance on the offline test split of our model as well as the compared models in Table 1. The models include: LSTM [37], which encodes the image using CNN and decodes it using LSTM; SCST [31], which employs a modified visual attention and is the first to use SCST to directly optimize the evaluation metrics; Up-Down [2], which employs a two-LSTM layer model with bottom-up features extracted from Faster-RCNN; RFNet [20], which fuses encoded features from multiple CNN networks; GCN-LSTM [49], which predicts visual relationships between every two entities in the image and encodes the relationship information into feature vectors; and SGAE [44], which introduces auto-encoding scene graphs into its model.…”
Section: Quantitative Analysismentioning
confidence: 99%
“…We report the performance on the offline test split of our model as well as the compared models in Table 1. The models include: LSTM [37], which encodes the image using CNN and decodes it using LSTM; SCST [31], which employs a modified visual attention and is the first to use SCST to directly optimize the evaluation metrics; Up-Down [2], which employs a two-LSTM layer model with bottom-up features extracted from Faster-RCNN; RFNet [20], which fuses encoded features from multiple CNN networks; GCN-LSTM [49], which predicts visual relationships between every two entities in the image and encodes the relationship information into feature vectors; and SGAE [44], which introduces auto-encoding scene graphs into its model.…”
Section: Quantitative Analysismentioning
confidence: 99%
“…In this paper, we propose the Reflective Decoding Network (RDN) for image captioning, which mitigates the drawback of traditional caption decoder by enhancing its long sequential modeling ability. Different from previous methods which boost captioning performance by improving the visual attention mechanism [2,26,45], or by improving the encoder to supply more meaningful intermediate representation for the decoder [17,47,48,50], our RDN focuses directly on the target decoding side and jointly apply attention mechanism in both visual and textual domain.…”
Section: Basis Decodermentioning
confidence: 99%
“…We compare our proposed RDN with other state-of-theart image captioning methods considering different aspects both in offline and online situation. Latest and representative works include: (1) Adaptive [26] which proposes the adaptive attention through designing a visual sentinel gate for captioning model to decide whether to attend to the image feature or just rely on the sequential language model, (2) LSTM-A3 [49] which incorporates the high level semantic attribute information to the encoder-decoder model, (3) Up-Down [2] which introduces the bottom-up and topdown attention mechanism to enable attention calculated at the level of objects or salient subregions and (4) RFNet [17] which uses multiple kinds of CNNs to extract complementary image feature and generate a more informative repre-sentation for the decoder.…”
Section: Performance Comparison and Analysismentioning
confidence: 99%
“…tioning models: ATT [54], SAT [52], RFNet [20], and Up-Down (UD) [3]. The results are shown in Table 2.…”
Section: Cross-modal Generationmentioning
confidence: 99%
“…Our model, despite using a much shallower CNN, outperforms ATT and SAT by a large margin. The other two baselines use even more sophisticated image encoders: RFNet [20] combines ResNet-101 [16], DenseNet [18], Inception-V3/V4/Resnet-V2 [40], all pretrained on ImageNet [10]. UpDown (UD) [3] uses a Faster R-CNN [37] with Resnet-101 [16] pretrained on ImageNet [10] and finetuned on Visual Genome [24] and COCO [6].…”
Section: Cross-modal Generationmentioning
confidence: 99%