2D image-based 3D shape retrieval (2D-to-3D) investigates the problem of matching the relevant 3D shapes from gallery dataset when given a query image. Recently, adversarial training and environmental style transfer learning have been successful applied to this task and achieved state-of-the-art performance. However, there still exist two problems. First, previous works only concentrate on the connection between the label and representation, where the unique visual characteristics of each instance are paid less attention. Second, the confused features or the transformed images can only cheat the discriminator but can not guarantee the semantic consistency. In another words, features of 2D desk may be mapped nearby the features of 3D chair. In this paper, we propose a novel semantic consistency guided instance feature alignment network (SC-IFA) to address these limitations. SC-IFA mainly consists of two parts, instance visual feature extraction and cross-domain instance feature adaptation. For the first module, unlike previous methods, which merely employ 2D CNN to extract the feature, we additionally maximize the mutual information between the input and feature to enhance the capability of feature representation for each instance. For the second module, we first introduce the margin disparity discrepancy model to mix up the cross-domain features in an adversarial training way. Then, we design two feature translators to transform the feature from one domain to another domain, and impose the translation loss and correlation loss on the transformed features to preserve the semantic consistency. Extensive experimental results on two benchmarks, MI3DOR and MI3DOR-2, verify SC-IFA is superior to the state-of-the-art methods.
The rapid development of 3D technique has led to the dramatic increase in 3D data. The scalable and effective 3D object retrieval and classification algorithms become mandatory for large-scale 3D object management. One critical problem of view-based 3D object retrieval and classification is how to exploit the relevance and discrimination among multiple views. In this paper, we propose a multi-view hierarchical fusion network (MVHFN) for these two tasks. This method mainly contains two key modules. First, the module of visual feature learning applies the 2D CNNs to extract the visual feature of multiple views rendered around the specific 3D object. Then, the multi-view hierarchical fusion module we proposed is employed to fuse the multiple view features into a compact descriptor. This module can not only fully exploit the relevance among multiple views by intra-cluster multi-view fusion mechanism, but also discover the content discrimination by inter-cluster multi-view fusion mechanism. Experimental results on two public datasets, i.e., ModelNet40 and ShapeNetCore55, show that our proposed MVHFN outperforms the current state-of-the-art methods in both the 3D object retrieval and classification tasks. INDEX TERMS 3D object retrieval, 3D object classification, 3D shape recognition, multi-view.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.