Visual navigation needs the agent locate the given target with visual perception. To enable robots to effectively execute tasks, combining large language models (LLMs) with multi-modal inputs in navigation is necessary. While LLMs offer rich semantic knowledge, they lack specific real-world information and real-time interaction capabilities. This paper introduces a Multi-modal Scene Graph (MMSG) navigation framework that aligns LLMs with visual perception models to predict next steps. Firstly, a multi-modal scene dataset is constructed, containing triplets of object-relations-target words. We provide target words and lists of existing objects in the scene to generate a large number of instructions and corresponding action plans for GPT$$-$$
-
3.5. The generated data is then utilized for pre-train LLM for path planning. During inference, we discover objects in the scene by extending the DETR visual object detector to multi-view RGB image collected from different reachable positions. Experimental results show that path planning generated from MMSG outperforms state-of-the-art methods, indicating its feasibility in complex environments. We evaluate our methods on the ProTHOR dataset and show superior navigation performance.