Proceedings of the 5th Workshop on Vision and Language 2016
DOI: 10.18653/v1/w16-3202
|View full text |Cite
|
Sign up to set email alerts
|

Combining Lexical and Spatial Knowledge to Predict Spatial Relations between Objects in Images

Abstract: Explicit representations of images are useful for linguistic applications related to images. We design a representation based on first-order models that capture the objects present in an image as well as their spatial relations. We take a supervised learning approach to the spatial relation classification problem and study the effects of spatial and lexical information on prediction performance. We find that lexical information is required to accurately predict spatial relations when combined with location inf… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
7
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 9 publications
(7 citation statements)
references
References 16 publications
0
7
0
Order By: Relevance
“…In logic-based approaches to semantic representations, FOL structures (also called FOL models) are used to represent semantic information in images (Hürlimann and Bos, 2016), An FOL structure is a pair (D, I) where D is a domain (also called universe) consisting of all the entities in an image and I is an interpretation function that maps a 1-place predicate to a set of entities and a 2-place predicate to a set of pairs of entities, and so on; for instance, we write I(man) = {d 1 } if the entity d 1 is a man, and I(next to) = {(d 1 , d 2 )} if d 1 is next to d 2 . FOL structures have clear correspondence with the graph representations of images in that they both capture the categories, attributes and relations holding of the entities in an image.…”
Section: Fol Structurementioning
confidence: 99%
See 1 more Smart Citation
“…In logic-based approaches to semantic representations, FOL structures (also called FOL models) are used to represent semantic information in images (Hürlimann and Bos, 2016), An FOL structure is a pair (D, I) where D is a domain (also called universe) consisting of all the entities in an image and I is an interpretation function that maps a 1-place predicate to a set of entities and a 2-place predicate to a set of pairs of entities, and so on; for instance, we write I(man) = {d 1 } if the entity d 1 is a man, and I(next to) = {(d 1 , d 2 )} if d 1 is next to d 2 . FOL structures have clear correspondence with the graph representations of images in that they both capture the categories, attributes and relations holding of the entities in an image.…”
Section: Fol Structurementioning
confidence: 99%
“…We use two datasets: Visual Genome (Krishna et al, 2017), which contains pairs of scene graphs and images, and GRIM dataset (Hürlimann and Bos, 2016), which annotates an FOL structure of an image and two types of captions (true and false sentences with respect to the image). Note that our system is fully unsupervised and does not require any training data; in the following, we describe only test set creation procedure.…”
Section: Datasetmentioning
confidence: 99%
“…5 The relation between images and models is implicit in (Young et al, 2014), from where we took inspiration, but not further developed there in the way that we are attempting here. Hürlimann and Bos (2016) make an explicit connection between image and models, but only look at denotations; as do Schlangen et al (2016).…”
Section: Corpora Used Herementioning
confidence: 99%
“…In comparison, the largest "classical" semantics resource, the Groningen Meaning Bank(Bos et al, 2017), provides some 10,000 annotated sentences, and the Parallel Meaning Bank(Abzianidze et al, 2017) another 15,000. There is no competition here, though: the Meaning Bank annotations are obviously much deeper and much more detailed; the proposal in this paper is to view the image corpora discussed here as complementary.5 The relation between images and models is implicit in(Young et al, 2014), from where we took inspiration, but not further developed there in the way that we are attempting here Hürlimann and Bos (2016). make an explicit connection between image and models, but only look at denotations; as doSchlangen et al (2016).…”
mentioning
confidence: 97%
“…We must be able to observe and describe visible objects and the spatial relationships between them. Before addressing paths and navigation tasks, we can make considerable progress by improving our data and modeling for spatial relations in tasks like image segmentation and image captioning (Hall et al, 2011;Hürlimann and Bos, 2016), grounding referential expressions (Kazemzadeh et al, 2014;Mao et al, 2016;Hu et al, 2017), relative positioning of objects (Kitaev and Klein, 2017) and image geolocation (Hays and Efros, 2008;Zamir et al, 2016). We will create collaborative image identification and description tasks that emphasize spatial relations and geographically salient landmarks.…”
Section: Tasksmentioning
confidence: 99%