ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Achlioptas, Panos; Abdelreheem, Ahmed; Xia, Fei; Elhoseiny, Mohamed; Guibas, Leonidas J.

doi:10.1007/978-3-030-58452-8_25

Cited by 131 publications

(176 citation statements)

References 52 publications

Supporting

Mentioning

175

Contrasting

Order By: Relevance

“…To avoid such pitfalls, algorithms and techniques need to be developed for processing 3D inputs such as RGB-D, meshes, and point clouds in conjunction with language. Some pioneering works have already begun in this direction (Achlioptas et al, 2020;Liu et al, 2021;Roh et al, 2021) and we anticipate the trend 78 to shift more towards developing algorithms for understanding as well as the generation of 3D scenes (Briq et al, 2021), while utilizing language as a main or auxiliary modality.…”

Section: Future Directionsmentioning

confidence: 99%

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Mogadala

Kalimuthu

Klakow

2021

jair

View full text Add to dashboard Cite

Interest in Artificial Intelligence (AI) and its applications has seen unprecedented growth in the last few years. This success can be partly attributed to the advancements made in the sub-fields of AI such as machine learning, computer vision, and natural language processing. Much of the growth in these fields has been made possible with deep learning, a sub-area of machine learning that uses artificial neural networks. This has created significant interest in the integration of vision and language. In this survey, we focus on ten prominent tasks that integrate language and vision by discussing their problem formulation, methods, existing datasets, evaluation measures, and compare the results obtained with corresponding state-of-the-art methods. Our efforts go beyond earlier surveys which are either task-specific or concentrate only on one type of visual content, i.e., image or video. Furthermore, we also provide some potential future directions in this field of research with an anticipation that this survey stimulates innovative thoughts and ideas to address the existing challenges and build new applications.

show abstract

Section: Future Directionsmentioning

confidence: 99%

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Mogadala

Kalimuthu

Klakow

2021

jair

View full text Add to dashboard Cite

show abstract

“…Existing works focus on using language to confine individual objects, e.g., detecting referred 3D objects [7] or distinguishing objects according to language phrases [2]. Recently, ScanRefer [6] and ReferIt3D [1] introduce a task of localizing objects within a 3D scene given the linguistic descriptions, namely 3D visual grounding. Following them, several works are proposed to improve the performance through instance segmentation [14,46], or Transformer [33,44,49].…”

Section: Scene Graph Normalizationmentioning

confidence: 99%

“…ReferIt3D [1]: It is initially a model for the 3D visual grounding task. The network first extracts object features through PointNet++ [31].…”

Section: Vqa-3dmentioning

confidence: 99%

“…TransVQA3D achieves the best result almost in all question categories, significantly surpassing the pure language model. It should be noted that utilizing the feature fusion in ReferIt3D [1] cannot effectively improve the VQA-3D performance. Without the well-designed architecture and scene graph aware feature enhancement, the performance on 3D scene understanding will be degraded.…”

Section: Vqa-3dmentioning

confidence: 99%

“…These works pay more attention to each individual object but ignore the interobject's context and relationships, only utilizing them to improve the per-object recognition. Recently, applying natural language to cooperatively improve the scene understanding has become a hot research topic, where 3D visual grounding [1,6], 3D dense captioning [8] and scene graph analysis [40,42,48] are increasingly studied. Compared with the reasoning from 2D images, reasoning in real-world 3D scenes can avoid the inherent spatial ambiguity in 2D data, and capture the real geometric information and inter-object relationships.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

Xu¹,

Yuan²,

Du³

et al. 2021

Preprint

View full text Add to dashboard Cite

3D scene understanding is a relatively emerging research field. In this paper, we introduce the Visual Question Answering task in 3D real-world scenes (VQA-3D), which aims to answer all possible questions given a 3D scene. To tackle this problem, the first VQA-3D dataset, namely CLEVR3D, is proposed, which contains 60K questions in 1,129 real-world scenes. Specifically, we develop a question engine leveraging 3D scene graph structures to generate diverse reasoning questions, covering the questions of objects' attributes (i.e., size, color, and material) and their spatial relationships. Built upon this dataset, we further design the first VQA-3D baseline model, TransVQA3D. The TransVQA3D model adopts well-designed Transformer ar-chitectures to achieve superior VQA-3D performance, compared with the pure language baseline and previous 3D reasoning methods directly applied to 3D scenarios. Experimental results verify that taking VQA-3D as an auxiliary task can boost the performance of 3D scene understanding, including scene graph analysis for the node-wise classification and whole-graph recognition.

show abstract

Bioinspired Interfacial Materials and Devices at the School of Chemistry at Beihang University

Cheng

Jiang

2018

Adv Funct Materials

View full text Add to dashboard Cite

This paper aims to experimentally and numerically probe fatigue behaviours and lifetimes of novel GLARE (glass laminate aluminium reinforced epoxy) laminates under random loading spectrum. A mixed algorithm based on fatigue damage concepts of three-phase materials was proposed for modelling progressive fatigue damage mechanisms and fatigue life of fibre metal laminates (FML) under random loading spectrum. To validate the proposed modelling algorithm, fatigue tests were conducted on the GLARE 2/1 and GLARE 3/2 laminates subjected to random loading spectrum, and fatigue mechanisms were discussed by using scanning electron microscope (SEM) analysis. It is shown that predominant fatigue failure of the GLARE laminate depends on the reference load level of random loading spectrum. Specifically, dominant fatigue failure of the GLARE laminate is dependent on fatigue strength of fibre layer at a high reference load level, but metal layer at a low reference load level. Numerical predictions agree well with experimental results, demonstrating that the proposed mixed modelling algorithm can effectively simulate fatigue behaviours and lives of the GLARE laminate under random loading spectrum.

show abstract

ReferIt3D: Neural Listeners for Fine-Grained 3D Object Identification in Real-World Scenes

Cited by 131 publications

References 52 publications

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods

CLEVR3D: Compositional Language and Elementary Visual Reasoning for Question Answering in 3D Real-World Scenes

Bioinspired Interfacial Materials and Devices at the School of Chemistry at Beihang University

Contact Info

Product

Resources

About