This paper proposes a graph-based deep framework for detecting anomalous image regions in human monitoring. The most relevant previous methods, which adopt deep models to obtain salient regions with captions, focus on discovering anomalous single regions and anomalous region pairs. However, they cannot detect an anomaly involving more than two regions and have deficiencies in capturing interactions among humans and objects scattered in multiple regions. For instance, the region of a man making a phone call is normal when it is located close to a kitchen sink and a soap bottle, as they are in a resting area, but abnormal when close to a bookshelf and a notebook PC, as they are in a working area. To overcome this limitation, we propose a spatial and semantic attributed graph and develop a Spatial and Semantic Graph Auto-Encoder (SSGAE). Specifically, the proposed graph models the “context” of a region in an image by considering other regions with spatial relations, e.g., a man sitting on a chair is adjacent to a white desk, as well as other region captions with high semantic similarities, e.g., “a man in a kitchen” is semantically similar to “a white chair in the kitchen”. In this way, a region and its context are represented by a node and its neighbors, respectively, in the spatial and semantic attributed graph. Subsequently, SSGAE is devised to reconstruct the proposed graph to detect abnormal nodes. Extensive experimental results indicate that the AUC scores of SSGAE improve from 0.79 to 0.83, 0.83 to 0.87, and 0.91 to 0.93 compared with the best baselines on three real-world datasets.