3D question answering (3D-QA) aims to answer free-form nature language questions given 3D scenes represented by point clouds. Compared to traditional 2D-QA, 3D-QA poses a dual challenge for models by assessing their understanding of both object appearance and structure, along with their spatial relationships. In this work, we introduce a novel method, named M2AD, that leverages multi-modal data to enhance the representation of 3D scene point clouds during the training phase. Specifically, we augment the capabilities of the model by incorporating 2D features corresponding to 3D objects and captions corresponding to the scene into the 3D object proposal stage, thereby endowing it with stronger representation abilities. Furthermore, to ensure self-reliance during inference without the need for additional data, we adopt a teacher-student framework to distill the enhanced model's knowledge to a model solely utilizing point cloud data. Extensive experimentation substantiates the effectiveness of our proposed model.