A scene graph provides a powerful intermediate knowledge structure for various visual tasks, including semantic image retrieval, image captioning, and visual question answering. In this paper, the task of predicting a scene graph for an image is formulated as two connected problems, ie, recognizing the relationship triplets, structured as , and constructing the scene graph from the recognized relationship triplets. For relationship triplet recognition, we develop a novel hierarchical recurrent neural network with visual attention mechanism. This model is composed of two attention‐based recurrent neural networks in a hierarchical organization. The first network generates a topic vector for each relationship triplet, whereas the second network predicts each word in that relationship triplet given the topic vector. This approach successfully captures the compositional structure and contextual dependency of an image and the relationship triplets describing its scene. For scene graph construction, an entity localization approach to determine the graph structure is presented with the assistance of available attention information. Then, the procedures for automatically converting the generated relationship triplets into a scene graph are clarified through an algorithm. Extensive experimental results on two widely used data sets verify the feasibility of the proposed approach.