Semantic extraction for images is an urgent problem and is applied in many different semantic retrieval systems. In this paper, a semantic-based image retrieval (SBIR) system is proposed based on the combination of growth partitioning tree (GP-Tree), which was built in the authors' previous work, with a self-organizing map (SOM) network and neighbor graph (called SgGP-Tree) to improve accuracy. For each query image, a similar set of images is retrieved on the SgGP-Tree, and a set of visual words is extracted relying on the classes obtained from mask region-based convolutional neural networks (R-CNN), as the basis for querying semantic of input images on ontology by simple protocol and resource description framework query language (SPARQL) query. The experiment was performed on image datasets, such as ImageCLEF and MS-COCO, with precision values of 0.898453 and 0.875467, respectively. These results are compared with related works on the same image dataset, showing the effectiveness of the methods proposed.