CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

Xu, Yan; Yuan, Zhihao; Du, Yan-Jun; Liao, Yinghong; Guo, Yufeng; Li, Zhen; Cui, Shuguang

doi:10.48550/arxiv.2112.11691

Cited by 2 publications

(2 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…is the main problem of 3D-QA. Recently, some 3D-QA datasets [1,2,8,9] have emerged at the same time. We choose ScanQA [2] as our research object since the other two are not open source yet.…”

Section: D Visual Question Answeringmentioning

confidence: 99%

Multi-modal assisted knowledge distillation for 3D question answering

2023

Sixth International Conference on Computer Information Science and Application Technology (CISAT 2023)

View full text Add to dashboard Cite

3D question answering (3D-QA) aims to answer free-form nature language questions given 3D scenes represented by point clouds. Compared to traditional 2D-QA, 3D-QA poses a dual challenge for models by assessing their understanding of both object appearance and structure, along with their spatial relationships. In this work, we introduce a novel method, named M2AD, that leverages multi-modal data to enhance the representation of 3D scene point clouds during the training phase. Specifically, we augment the capabilities of the model by incorporating 2D features corresponding to 3D objects and captions corresponding to the scene into the 3D object proposal stage, thereby endowing it with stronger representation abilities. Furthermore, to ensure self-reliance during inference without the need for additional data, we adopt a teacher-student framework to distill the enhanced model's knowledge to a model solely utilizing point cloud data. Extensive experimentation substantiates the effectiveness of our proposed model.

show abstract

Section: D Visual Question Answeringmentioning

confidence: 99%

Multi-modal assisted knowledge distillation for 3D question answering

2023

Sixth International Conference on Computer Information Science and Application Technology (CISAT 2023)

View full text Add to dashboard Cite

show abstract

“…Traditional 3D scenario understanding tasks mostly focus on individual targets and overlook target relations. Literature [ 43 ] introduced visual question answering in real 3D scenarios, aiming to answer all possible questions given a 3D scenario. They designed TransVQA3D, which first uses a cross-modal Transformer to fuse question and target characteristics.…”

Section: Related Workmentioning

confidence: 99%

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Sheng

2023

PLoS ONE

View full text Add to dashboard Cite

Existing visual question answering methods typically concentrate only on visual targets in images, ignoring the key textual content in the images, thereby limiting the depth and accuracy of image content comprehension. Inspired by this, we pay attention to the task of text-based visual question answering, address the performance bottleneck issue caused by over-fitting risk in existing self-attention-based models, and propose a scenario text visual question answering method called INT2-VQA that fuses knowledge manifestation based on inter-modality and intra-modality collaborations. Specifically, we model the complementary priori knowledge of locational collaboration between visual targets and textual targets across modalities and the contextual semantical collaboration among textual word targets within a modality. Based on this, a universal knowledge-reinforced attention module is designed to achieve a unified encoding manifestation of both relations. Extensive ablation experiments, contrast experiments, and visual analyses demonstrate the effectiveness of the proposed method and prove its superiority over the other state-of-the-art methods.

show abstract

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

Cited by 2 publications

References 29 publications

Multi-modal assisted knowledge distillation for 3D question answering

Multi-modal assisted knowledge distillation for 3D question answering

Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations

Contact Info

Product

Resources

About