As the applications of robots expand across a wide variety of areas, high-level task planning considering human–robot interactions is emerging as a critical issue. Various elements that facilitate flexible responses to humans in an ever-changing environment, such as scene understanding, natural language processing, and task planning, are thus being researched extensively. In this study, a visual question answering (VQA) task was examined in detail from among an array of technologies. By further developing conventional neuro-symbolic approaches, environmental information is stored and utilized in a symmetric graph format, which enables more flexible and complex high-level task planning. We construct a symmetric graph composed of information such as color, size, and position for the objects constituting the environmental scene. VQA, using graphs, largely consists of a part expressing a scene as a graph, a part converting a question into SPARQL, and a part reasoning the answer. The proposed method was verified using a public dataset, CLEVR, with which it successfully performed VQA. We were able to directly confirm the process of inferring answers using SPARQL queries converted from the original queries and environmental symmetric graph information, which is distinct from existing methods that make it difficult to trace the path to finding answers.