Humans often leverage spatial clues to categorize scenes in a fraction of a second. This form of intelligence is very relevant in time-critical situations (e.g., when driving a car) and valuable to transfer to automated systems. This work investigates the predictive power of solely processing spatial clues for scene understanding in 2D images and compares such an approach with the predictive power of visual appearance. To this end, we design the laboratory task of predicting the identity of two objects (e.g., "man" and "horse") and their relationship or predicate (e.g., "riding") given exclusively the ground truth bounding box coordinates of both objects. We also measure the performance attainable in Human Object Interaction (HOI) detection, a real-world spatial task, which includes a setting where ground truth boxes are not available at test time. An additional goal is to identify the principles necessary to effectively represent a spatial template, that is, the visual region in which two objects involved in a relationship expressed by a predicate occur. We propose a scale-, mirror-, and translation-invariant representation that captures the spatial essence of the relationship, i.e., a canonical spatial representation. Tests in two benchmarks reveal:(1) High performance is attainable by using exclusively spatial information in all tasks. (2) In HOI detection, the canonical template outperforms the rest of spatial, visual, and several state-of-the-art baselines. (3) Simple fusion of visual and spatial features substantially improves performance. (4) Our methods fare remarkably well with a small amount of data and rare categories. Our results obtained on the Visual Genome (VG) and the Humans Interacting with Common Objects -Detection (HICO-DET) datasets indicate that great predictive power can be obtained from spatial clues alone, opening up possibilities for performing fast scene understanding at a glance.INDEX TERMS Spatial understanding, spatial layout, computer vision, vision and scene understanding.
I. INTRODUCTION
1A Well-researched concept in cognitive science is the 2 gist, or the initial representation of a scene obtained 3 in a brief glance. The gist may include semantic content 4 (e.g., "is a classroom"), the identity of a few objects (e.g.,
5"there are books"), and the spatial layout [1]. Humans can 6 categorize scenes in a fraction of a second (∼13-250 ms)
7[1], [2]. Generally, more detailed scenes and finer-grained 8