TMVNet : Using Transformers for Multi-view Voxel-based 3D Reconstruction

Peng, Kebin; Islam, Rifatul; Quarles, John; Desai, Kevin

doi:10.1109/cvprw56347.2022.00036

Cited by 18 publications

(17 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Based on the network, they proposed an RNN-based model to gain the 3D corresponding representation from the input image. TMVNet [ 18 ] applied the transformers to the encoder and proposed a 3D feature fusion layer to refine the predictions. Kniaz et al [ 19 ] proposed an image-to-voxel translation model which applied a generative adversarial network.…”

Section: Related Workmentioning

confidence: 99%

A Single Stage and Single View 3D Point Cloud Reconstruction Network Based on DetNet

Zhu

2022

Sensors

View full text Add to dashboard Cite

It is a challenging problem to infer objects with reasonable shapes and appearance from a single picture. Existing research often pays more attention to the structure of the point cloud generation network, while ignoring the feature extraction of 2D images and reducing the loss in the process of feature propagation in the network. In this paper, a single-stage and single-view 3D point cloud reconstruction network, 3D-SSRecNet, is proposed. The proposed 3D-SSRecNet is a simple single-stage network composed of a 2D image feature extraction network and a point cloud prediction network. The single-stage network structure can reduce the loss of the extracted 2D image features. The 2D image feature extraction network takes DetNet as the backbone. DetNet can extract more details from 2D images. In order to generate point clouds with better shape and appearance, in the point cloud prediction network, the exponential linear unit (ELU) is used as the activation function, and the joint function of chamfer distance (CD) and Earth mover’s distance (EMD) is used as the loss function of 3DSSRecNet. In order to verify the effectiveness of 3D-SSRecNet, we conducted a series of experiments on ShapeNet and Pix3D datasets. The experimental results measured by CD and EMD have shown that 3D-SSRecNet outperforms the state-of-the-art reconstruction methods.

show abstract

Section: Related Workmentioning

confidence: 99%

A Single Stage and Single View 3D Point Cloud Reconstruction Network Based on DetNet

Zhu

2022

Sensors

View full text Add to dashboard Cite

show abstract

“…The reconstruction is decoded from the weighted sum of latent codes. Transformer models incorporating self-attention have also been proposed for 3D reconstruction [10][11][12]. None of the attention-based methods supports iterative updating of a previous reconstruction, since these architectures expect to receive all input images at once.…”

Section: Related Workmentioning

confidence: 99%

“…AttSets [7] (2020) 0.685 Pix2Vox++/F [9] (2020) 0.696 Pix2Vox++/A [9] (2020) 0.715 EVolT [10] (2021) 0.698 TMVNet [12] (2022) 0.719 3D-R2N2 [6] (2016) 0.635 Ours 0.690…”

Section: Iou Iterativementioning

confidence: 99%

“…Reconstruction algorithms have been presented in the literature that operate on various sensor modalities, including LIDAR [3], RADAR [4], RGB-D [5] and monocular (RGB) cameras [6][7][8][9][10][11][12][13]. In this paper, we focus on reconstruction from RGB images, since cameras are an attractive sensor modality for aerial and ground robots because they are more affordable, lighter and power-efficient than active sensor modalities.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Iterative Online 3D Reconstruction from RGB Images

Cardoen

Leroux

Simoens

2022

Sensors

View full text Add to dashboard Cite

3D reconstruction is the computer vision task of reconstructing the 3D shape of an object from multiple 2D images. Most existing algorithms for this task are designed for offline settings, producing a single reconstruction from a batch of images taken from diverse viewpoints. Alongside reconstruction accuracy, additional considerations arise when 3D reconstructions are used in real-time processing pipelines for applications such as robot navigation or manipulation. In these cases, an accurate 3D reconstruction is already required while the data gathering is still in progress. In this paper, we demonstrate how existing batch-based reconstruction algorithms lead to suboptimal reconstruction quality when used for online, iterative 3D reconstruction and propose appropriate modifications to the existing Pix2Vox++ architecture. When additional viewpoints become available at a high rate, e.g., from a camera mounted on a drone, selecting the most informative viewpoints is important in order to mitigate long term memory loss and to reduce the computational footprint. We present qualitative and quantitative results on the optimal selection of viewpoints and show that state-of-the-art reconstruction quality is already obtained with elementary selection algorithms.

show abstract

“…Therefore, it lacks stochastic learning capability in the mapping between the extracted image features and the reconstructed 3D models. Peng, K. et al [24] used a transformer-based encoder-decoder called TMVNet, which outperforms previous methods for 3D reconstruction. This method uses 2D CNN encoders to extract multiple-viewpoint image features and passes the extracted features to two transformer encoders to generate 3D feature vectors.…”

Section: Related Workmentioning

confidence: 99%

A Voxel Generator Based on Autoencoder

2022

View full text Add to dashboard Cite

In recent years, 3D models have been widely used in the virtual/augmented reality industry. The traditional way of constructing 3D models for real-world objects remains expensive and time-consuming. With the rapid development of graphics processors, many approaches based on deep learning models have been proposed to reduce the time and economic cost of the generation of 3D object models. However, the quality of the generated 3D object models leaves considerable room for improvement. Accordingly, we designed and implemented a voxel generator called VoxGen, based on the autoencoder framework. It consists of an encoder that extracts image features and a decoder that maps feature values to voxel models. The main characteristics of VoxGen are exploiting modified VGG16 and ResNet18 to enhance the effect of feature extraction and mixing the deconvolution layer with the convolution layer in the encoder to enhance the feature of generated voxels. Our experimental results show that VoxGen outperforms related approaches in terms of the volumetric intersection over union (IOU) values of generated voxels.

show abstract

TMVNet : Using Transformers for Multi-view Voxel-based 3D Reconstruction

Cited by 18 publications

References 23 publications

A Single Stage and Single View 3D Point Cloud Reconstruction Network Based on DetNet

A Single Stage and Single View 3D Point Cloud Reconstruction Network Based on DetNet

Iterative Online 3D Reconstruction from RGB Images

A Voxel Generator Based on Autoencoder

Contact Info

Product

Resources

About