2018
DOI: 10.1017/s1351324918000098
|View full text |Cite
|
Sign up to set email alerts
|

Where to put the image in an image caption generator

Abstract: When a recurrent neural network language model is used for caption generation, the image information can be fed to the neural network either by directly incorporating it in the RNN -conditioning the language model by 'injecting' image features -or in a layer following the RNNconditioning the language model by 'merging' image features. While both options are attested in the literature, there is as yet no systematic comparison between the two. In this paper we empirically show that it is not especially detriment… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
46
0
2

Year Published

2019
2019
2022
2022

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 96 publications
(48 citation statements)
references
References 33 publications
0
46
0
2
Order By: Relevance
“…For our experiments, we used a variety of pre-trained neural caption generators (36 in all) from [23]. 4 These models are based on four different caption generator architectures.…”
Section: Methodsmentioning
confidence: 99%
“…For our experiments, we used a variety of pre-trained neural caption generators (36 in all) from [23]. 4 These models are based on four different caption generator architectures.…”
Section: Methodsmentioning
confidence: 99%
“…Our method is to add step-by-step modules and configurations to the network providing different kind of top-down knowledge in Section 2 and investigating the performance of such configura-tions. There are several design choices with small effects on the performance but costly in terms of parameter size (Tanti et al, 2018b). Therefore, if there is no research question related to that choice, we take the simplest choice as reported in the previous work such as (Lu et al, 2017;Anderson et al, 2018).…”
Section: Neural Network Designmentioning
confidence: 99%
“…The multimodal NMT toolkit is employed to build the multimodal NMT system for multimodal translation task, which are based on the pytorch port of OpenNMT (Klein et al, 2017). For text-only translation task, OpenNMT is deployed to build the NMT system and in the case of Hindi-only image captioning track, publicly available VGG16 and LSTM in Keras library, are used to build the system (Simonyan and Zisserman, 2015;Tanti et al, 2018). We have used Hindi visual genome dataset in each track of WAT2019 multi-modal translation task provided by the organizer (Nakazawa et al, 2019).…”
Section: System Descriptionmentioning
confidence: 99%
“…Hence, we have chosen predicted translation at an optimum point on 24,000 epoch. In the training process of Hindi-only image captioning track, we have used merge-model following settings of (Tanti et al, 2018). The preprocessed image feature vector of 4096 elements are processed by a dense layer to provide 256 elements for representation of the image.…”
Section: System Trainingmentioning
confidence: 99%