“…Limited by the high resource-consuming of the multi-head attention, most of the previous works related to 3D transformers are only carefully performed on resource-saving feature processing, e.g., the one-off straightforward feature mapping without any downsampling or upsampling (Wang et al, 2021a), where the size of feature volumes remains unchanged, or the top-down tasks with only downsampling (Mao et al, 2021), where the size of feature volumes is reduced gradually. In 3D reconstruction, however, a top-down-bottom-up structure is more reasonable for feature extraction and prediction generation, as in most of the 3D-CNN-based structures (Murez et al, 2020;Sun et al, 2021;Stier et al, 2021). So in this work, we design the first 3D transformer based top-down-bottomup structure, as is shown in Figure 3.…”