Proceedings of the 12th International Conference on Natural Language Generation 2019
DOI: 10.18653/v1/w19-8668
|View full text |Cite
|
Sign up to set email alerts
|

What goes into a word: generating image descriptions with top-down spatial knowledge

Abstract: Generating grounded image descriptions requires associating linguistic units with their corresponding visual clues. A common method is to train a decoder language model with attention mechanism over convolutional visual features. Attention weights align the stratified visual features arranged by their location with tokens, most commonly words, in the target description. However, words such as spatial relations (e.g. next to and under) are not directly referring to geometric arrangements of pixels but to comple… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
6
2
1

Relationship

1
8

Authors

Journals

citations
Cited by 16 publications
(10 citation statements)
references
References 34 publications
0
10
0
Order By: Relevance
“…For instance, relations such as “the leg is next to the ear” or “the banana is in front of the nose” will not be learned by CNNs because of the small size of the convolutional kernel while objects and their features (“leg”, “ear”, “banana”, “nose”) will be detected ( Kelleher and Dobnik, 2017 ; Ghanimifard and Dobnik, 2018 ). Thus, one type of knowledge that is encoded by attention weights is the knowledge of long-distance visual dependencies between objects (see, for example, the study by Ghanimifard and Dobnik (2019) ).…”
Section: Methodsmentioning
confidence: 99%
“…For instance, relations such as “the leg is next to the ear” or “the banana is in front of the nose” will not be learned by CNNs because of the small size of the convolutional kernel while objects and their features (“leg”, “ear”, “banana”, “nose”) will be detected ( Kelleher and Dobnik, 2017 ; Ghanimifard and Dobnik, 2018 ). Thus, one type of knowledge that is encoded by attention weights is the knowledge of long-distance visual dependencies between objects (see, for example, the study by Ghanimifard and Dobnik (2019) ).…”
Section: Methodsmentioning
confidence: 99%
“…Regier (1996) designed the neurons to learn the meaning of spatial prepositions. Ghanimifard and Dobnik (2019) explored the effects of spatial knowledge in a generative neural language model for the image description. We mainly work on incorporating the spatial semantics in navigation neural agent.…”
Section: Related Workmentioning
confidence: 99%
“…Their model does not decode language to 2D spatial arrangements while reasoning about their position. Finally, Ghanimifard and Dobnik (2019) generate spatial image descriptions to investigate what kind of spatial bottom-up knowledge, benefits the top-down methods the most.…”
Section: Related Workmentioning
confidence: 99%