In scene parsing, the model is required to be able to process complex multi-modal data such as images and contexts in real scenes, and discover their implicit connections from objects existing in the scene. As a storage method that contains entity information and the relationship between entities, a knowledge graph can well express objects and the semantic relationship between objects in the scene. In this paper, a new multi-phase process was proposed to solve scene parsing tasks; first, a knowledge graph was used to align the multi-modal information and then the graph-based model generates results. We also designed an experiment of feature engineering’s validation for a deep-learning model to preliminarily verify the effectiveness of this method. Hence, we proposed a knowledge representation method named Entity Descriptor Encoder of Transformer (EDET), which uses both the entity itself and its internal attributes for knowledge representation. This method can be embedded into the transformer structure to solve multi-modal scene parsing tasks. EDET can aggregate the multi-modal attributes of entities, and the results in the scene graph generation and image captioning tasks prove that EDET has excellent performance in multi-modal fields. Finally, the proposed method was applied to the industrial scene, which confirmed the viability of our method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.