Recurrent Fusion Network for Image Captioning

Jiang, Wenhao; Ma, Lin; Jiang, Yu‐Gang; Liu, Wei; Zhang, Tong

doi:10.1007/978-3-030-01216-8_31

Cited by 258 publications

(131 citation statements)

References 53 publications

Supporting

Mentioning

131

Contrasting

Order By: Relevance

“…We report the performance on the offline test split of our model as well as the compared models in Table 1. The models include: LSTM [37], which encodes the image using CNN and decodes it using LSTM; SCST [31], which employs a modified visual attention and is the first to use SCST to directly optimize the evaluation metrics; Up-Down [2], which employs a two-LSTM layer model with bottom-up features extracted from Faster-RCNN; RFNet [20], which fuses encoded features from multiple CNN networks; GCN-LSTM [49], which predicts visual relationships between every two entities in the image and encodes the relationship information into feature vectors; and SGAE [44], which introduces auto-encoding scene graphs into its model.…”

Section: Quantitative Analysismentioning

confidence: 99%

Attention on Attention for Image Captioning

Huang

Wang

Chen

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

873

611

View full text Add to dashboard Cite

Attention mechanisms are widely used in current encoder/decoder frameworks of image captioning, where a weighted average on encoded vectors is generated at each time step to guide the caption decoding process. However, the decoder has little idea of whether or how well the attended vector and the given attention query are related, which could make the decoder give misled results. In this paper, we propose an "Attention on Attention" (AoA) module, which extends the conventional attention mechanisms to determine the relevance between attention results and queries. AoA first generates an "information vector" and an "attention gate" using the attention result and the current context, then adds another attention by applying element-wise multiplication to them and finally obtains the "attended information", the expected useful knowledge. We apply AoA to both the encoder and the decoder of our image captioning model, which we name as AoA Network (AoANet). Experiments show that AoANet outperforms all previously published methods and achieves a new state-ofthe-art performance of 129.8 CIDEr-D score on MS COCO "Karpathy" offline test split and 129.6 CIDEr-D (C40) score on the official online testing server. Code is available at https://github.com/husthuaan/AoANet.

show abstract

Section: Quantitative Analysismentioning

confidence: 99%

Attention on Attention for Image Captioning

Huang

Wang

Chen

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

873

611

View full text Add to dashboard Cite

show abstract

“…In this paper, we propose the Reflective Decoding Network (RDN) for image captioning, which mitigates the drawback of traditional caption decoder by enhancing its long sequential modeling ability. Different from previous methods which boost captioning performance by improving the visual attention mechanism [2,26,45], or by improving the encoder to supply more meaningful intermediate representation for the decoder [17,47,48,50], our RDN focuses directly on the target decoding side and jointly apply attention mechanism in both visual and textual domain.…”

Section: Basis Decodermentioning

confidence: 99%

“…We compare our proposed RDN with other state-of-theart image captioning methods considering different aspects both in offline and online situation. Latest and representative works include: (1) Adaptive [26] which proposes the adaptive attention through designing a visual sentinel gate for captioning model to decide whether to attend to the image feature or just rely on the sequential language model, (2) LSTM-A3 [49] which incorporates the high level semantic attribute information to the encoder-decoder model, (3) Up-Down [2] which introduces the bottom-up and topdown attention mechanism to enable attention calculated at the level of objects or salient subregions and (4) RFNet [17] which uses multiple kinds of CNNs to extract complementary image feature and generate a more informative repre-sentation for the decoder.…”

Section: Performance Comparison and Analysismentioning

confidence: 99%

Reflective Decoding Network for Image Captioning

Ke¹,

Pei²,

Li³

et al. 2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

State-of-the-art image captioning methods mostly focus on improving visual features, less attention has been paid to utilizing the inherent properties of language to boost captioning performance. In this paper, we show that vocabulary coherence between words and syntactic paradigm of sentences are also important to generate high-quality image caption. Following the conventional encoder-decoder framework, we propose the Reflective Decoding Network (RDN) for image captioning, which enhances both the longsequence dependency and position perception of words in a caption decoder. Our model learns to collaboratively attend on both visual and textual features and meanwhile perceive each word's relative position in the sentence to maximize the information delivered in the generated caption. We evaluate the effectiveness of our RDN on the COCO image captioning datasets and achieve superior performance over the previous methods. Further experiments reveal that our approach is particularly advantageous for hard cases with complex scenes to describe by captions.

show abstract

“…tioning models: ATT [54], SAT [52], RFNet [20], and Up-Down (UD) [3]. The results are shown in Table 2.…”

Section: Cross-modal Generationmentioning

confidence: 99%

“…Our model, despite using a much shallower CNN, outperforms ATT and SAT by a large margin. The other two baselines use even more sophisticated image encoders: RFNet [20] combines ResNet-101 [16], DenseNet [18], Inception-V3/V4/Resnet-V2 [40], all pretrained on ImageNet [10]. UpDown (UD) [3] uses a Faster R-CNN [37] with Resnet-101 [16] pretrained on ImageNet [10] and finetuned on Visual Genome [24] and COCO [6].…”

Section: Cross-modal Generationmentioning

confidence: 99%

Unpaired Image-to-Speech Synthesis With Multimodal Information Bottleneck

McDuff

Song

2019

2019 IEEE/CVF International Conference on Computer Vision (ICCV)

View full text Add to dashboard Cite

Deep generative models have led to significant advances in cross-modal generation such as text-to-image synthesis. Training these models typically requires paired data with direct correspondence between modalities. We introduce the novel problem of translating instances from one modality to another without paired data by leveraging an intermediate modality shared by the two other modalities. To demonstrate this, we take the problem of translating images to speech. In this case, one could leverage disjoint datasets with one shared modality, e.g., image-text pairs and text-speech pairs, with text as the shared modality. We call this problem "skip-modal generation" because the shared modality is skipped during the generation process. We propose a multimodal information bottleneck approach that learns the correspondence between modalities from unpaired data (image and speech) by leveraging the shared modality (text). We address fundamental challenges of skipmodal generation: 1) learning multimodal representations using a single model, 2) bridging the domain gap between two unrelated datasets, and 3) learning the correspondence between modalities from unpaired data. We show qualitative results on image-to-speech synthesis; this is the first time such results have been reported in the literature. We also show that our approach improves performance on traditional cross-modal generation, suggesting that it improves data efficiency in solving individual tasks.

show abstract

Recurrent Fusion Network for Image Captioning

Cited by 258 publications

References 53 publications

Attention on Attention for Image Captioning

Attention on Attention for Image Captioning

Reflective Decoding Network for Image Captioning

Unpaired Image-to-Speech Synthesis With Multimodal Information Bottleneck

Contact Info

Product

Resources

About