“…Most existing work obtain the best results by combining the penultimate layer of the CNN (via concatenation, summation, etc.) with the final state of the source sentence representation and using it to initialize the target RNN (Caglayan et al, 2016;Calixto et al, 2016;Huang et al, 2016). Recent work also explores an attention mechanism where they use lower level CNN features of the images, such as a convolutional layer, and condition the source and the target sentences on the image features (Calixto et al, 2016;Calixto et al, 2017).…”