The CNN-based methods have achieved impressive results in medical image segmentation, but it failed to capture the long-range dependencies due to the inherent locality of convolution operation. Transformer-based methods are popular in vision tasks recently because of its capacity of long-range dependencies and get a promising performance. However, it lacks in modeling local context, although some works attempted to embed convolutional layer to overcome this problem and achieved some improvement, but it makes the feature inconsistent and fails to leverage the natural multi-scale features of hierarchical transformer, which limit the performance of models. In this paper, taking medical image segmentation as an example, we present MISSFormer, an effective and powerful Medical Image Segmentation tranSFormer. MISSFormer is a hierarchical encoder-decoder network and has two appealing designs: 1) A feed forward network is redesigned with the proposed Enhanced Transformer Block, which makes features aligned adaptively and enhances the long-range dependencies and local context. 2) We proposed Enhanced Transformer Context Bridge, a context bridge with the enhanced transformer block to model the long-range dependencies and local context of multi-scale features generated by our hierarchical transformer encoder. Driven by these two designs, the MISSFormer shows strong capacity to capture more valuable dependencies and context in medical image segmentation. The experiments on multi-organ and cardiac segmentation tasks demonstrate the superiority, effectiveness and robustness of our MISSFormer, the exprimental results of MISSFormer trained from scratch even outperforms state-of-the-art methods pretrained on ImageNet, and the core designs can be generalized to other visual segmentation tasks. The code will be released in Github.
Gesture recognition aims at understanding dynamic gestures of the human body and is one of the most important ways of human–computer interaction; to extract more effective spatiotemporal features in gesture videos for more accurate gesture classification, a novel feature extractor network, spatiotemporal attention 3D DenseNet is proposed in this study. We extend DenseNet with 3D kernels and Refined Temporal Transition Layer based on Temporal Transition Layer, and we also explore attention mechanism in 3D ConvNets. We embed the Refined Temporal Transition Layer and attention mechanism in DenseNet3D, named the proposed network “spatiotemporal attention 3D DenseNet.” Our experiments show that our Refined Temporal Transition Layer performs better than Temporal Transition Layer and the proposed spatiotemporal attention 3D DenseNet in each modality outperforms the current state-of-the-art methods on the ChaLearn LAP Large-Scale Isolated gesture dataset. The code and pretrained model are released in https://github.com/dzf19927/STA3D .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.