Where to put the image in an image caption generator

Tanti, Marc; Gatt, Albert; Camilleri, Kenneth P.

doi:10.1017/s1351324918000098

Cited by 96 publications

(48 citation statements)

References 33 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…For our experiments, we used a variety of pre-trained neural caption generators (36 in all) from [23]. 4 These models are based on four different caption generator architectures.…”

Section: Methodsmentioning

confidence: 99%

Pre-gen Metrics: Predicting Caption Quality Metrics Without Generating Captions

Tanti

Gatt

Muscat

2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…For our experiments, we used a variety of pre-trained neural caption generators (36 in all) from [23]. 4 These models are based on four different caption generator architectures.…”

Section: Methodsmentioning

confidence: 99%

Pre-gen Metrics: Predicting Caption Quality Metrics Without Generating Captions

Tanti

Gatt

Muscat

2019

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

“…Our method is to add step-by-step modules and configurations to the network providing different kind of top-down knowledge in Section 2 and investigating the performance of such configura-tions. There are several design choices with small effects on the performance but costly in terms of parameter size (Tanti et al, 2018b). Therefore, if there is no research question related to that choice, we take the simplest choice as reported in the previous work such as (Lu et al, 2017;Anderson et al, 2018).…”

Section: Neural Network Designmentioning

confidence: 99%

What goes into a word: generating image descriptions with top-down spatial knowledge

Ghanimifard¹,

Dobnik²

2019

Proceedings of the 12th International Conference on Natural Language Generation

View full text Add to dashboard Cite

Generating grounded image descriptions requires associating linguistic units with their corresponding visual clues. A common method is to train a decoder language model with attention mechanism over convolutional visual features. Attention weights align the stratified visual features arranged by their location with tokens, most commonly words, in the target description. However, words such as spatial relations (e.g. next to and under) are not directly referring to geometric arrangements of pixels but to complex geometric and conceptual representations. The aim of this paper is to evaluate what representations facilitate generating image descriptions with spatial relations and lead to better grounded language generation. In particular, we investigate the contribution of four different representational modalities in generating relational referring expressions: (i) (pre-trained) convolutional visual features, (ii) spatial attention over visual features, (iii) top-down geometric relational knowledge between objects, and (iv) world knowledge captured by contextual embeddings in language models.

show abstract

“…The multimodal NMT toolkit is employed to build the multimodal NMT system for multimodal translation task, which are based on the pytorch port of OpenNMT (Klein et al, 2017). For text-only translation task, OpenNMT is deployed to build the NMT system and in the case of Hindi-only image captioning track, publicly available VGG16 and LSTM in Keras library, are used to build the system (Simonyan and Zisserman, 2015;Tanti et al, 2018). We have used Hindi visual genome dataset in each track of WAT2019 multi-modal translation task provided by the organizer (Nakazawa et al, 2019).…”

Section: System Descriptionmentioning

confidence: 99%

“…Hence, we have chosen predicted translation at an optimum point on 24,000 epoch. In the training process of Hindi-only image captioning track, we have used merge-model following settings of (Tanti et al, 2018). The preprocessed image feature vector of 4096 elements are processed by a dense layer to provide 256 elements for representation of the image.…”

Section: System Trainingmentioning

confidence: 99%