Multi-compound Transformer for Accurate Biomedical Image Segmentation

Ji, Yuanfeng; Zhang, Ruimao; Wang, Huijie; Li, Zhen; Wu, Lingyun; Zhang, Shaoting; Luo, Ping

doi:10.1007/978-3-030-87193-2_31

Cited by 92 publications

(41 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…ResUNet applies residual blocks as UNet building blocks [22]. MC- Trans [13] introduces a cross-attention block between the encoder and decoder to gather cross-scale dependencies of the feature maps. Refined DLA (rDLA) [19] bases its backbone on a leading CNN architecture, Deep Layer Aggregation (DLA) [28], and aggregates context information from cross-view through a refinement stage.…”

Section: Resultsmentioning

confidence: 99%

“…UTNet [8] instead incorporates interleaved transformer blocks and convolution blocks for small medical dataset. MC-Trans [13] employs a Transformer Cross-Attention (TCA) module to collect context information from feature maps of different scales. However, these approaches are designed for single-view and can be sub-optimal for complex segmentation tasks due to the absence of considering semantic dependencies of different scales and views, which are critical for enhancing clinical lesion assessment.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers

Liu¹,

Gao²,

Zhangli³

et al. 2022

Preprint

View full text Add to dashboard Cite

Combining information from multi-view images is crucial to improve the performance and robustness of automated methods for disease diagnosis. However, due to the non-alignment characteristics of multi-view images, building correlation and data fusion across views largely remain an open problem. In this study, we present TransFusion, a Transformer-based architecture to merge divergent multi-view imaging information using convolutional layers and powerful attention mechanisms. In particular, the Divergent Fusion Attention (DiFA) module is proposed for rich cross-view context modeling and semantic dependency mining, addressing the critical issue of capturing long-range correlations between unaligned data from different image views. We further propose the Multi-Scale Attention (MSA) to collect global correspondence of multi-scale feature representations. We evaluate TransFusion on the Multi-Disease, Multi-View & Multi-Center Right Ventricular Segmentation in Cardiac MRI (M&Ms-2) challenge cohort. TransFusion demonstrates leading performance against the state-of-the-art methods and opens up new perspectives for multi-view imaging integration towards robust medical image segmentation.

show abstract

Section: Resultsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers

Liu¹,

Gao²,

Zhangli³

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…Recently, the Transformer-based architecture has shown excellent success (Dosovitskiy et al, 2021). A commonly adopted strategy for image segmentation is to take a hybrid CNN-Transformer-based architecture (Xie et al, 2021;Ji et al, 2021). (Chen et al, 2021) proposed TransUnet structure that embeds Transformer in the encoder to enhance the longdistance dependency in features for 2D image segmentation tasks.…”

Section: Related Workmentioning

confidence: 99%

Memory-efficient Segmentation of High-resolution Volumetric MicroCT Images

Wang¹,

Blackie²,

Miguel-Aliaga³

et al. 2022

Preprint

View full text Add to dashboard Cite

In recent years, 3D convolutional neural networks have become the dominant approach for volumetric medical image segmentation. However, compared to their 2D counterparts, 3D networks introduce substantially more training parameters and higher requirement for the GPU memory. This has become a major limiting factor for designing and training 3D networks for high-resolution volumetric images. In this work, we propose a novel memoryefficient network architecture for 3D high-resolution image segmentation. The network incorporates both global and local features via a two-stage U-net-based cascaded framework and at the first stage, a memory-efficient U-net (meU-net) is developed. The features learnt at the two stages are connected via post-concatenation, which further improves the information flow. The proposed segmentation method is evaluated on an ultra high-resolution microCT dataset with typically 250 million voxels per volume. Experiments show that it outperforms state-of-the-art 3D segmentation methods in terms of both segmentation accuracy and memory efficiency.

show abstract

“…Currently, Dosovitskiy et al [6] applied Transformer architecture from NLP to computer vision as Vision Transformer (ViT), and showed that the sequences of image patches could perform very well with a pure transformer on image classification, while the convolutional networks usually suffer from difficulty in capturing and storing long-distance dependent information due to the limited receptive field. Following the ViT, many other vision transformer variants are proposed [4,17], and some of them have achieved great performance on various medical tasks [3,10,23,24,28] with the strong representation capabilities of transformer.…”

Section: Related Workmentioning

confidence: 99%

Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-modal Representation Consistency

Ma¹,

Zhu²,

Zhang³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The colorectal polyps classification is a critical clinical examination. To improve the classification accuracy, most computer-aided diagnosis algorithms recognize colorectal polyps by adopting Narrow-Band Imaging (NBI). However, the NBI usually suffers from missing utilization in real clinic scenarios since the acquisition of this specific image requires manual switching of the light mode when polyps have been detected by using White-Light (WL) images. To avoid the above situation, we propose a novel method to directly achieve accurate white-light colonoscopy image classification by conducting structured cross-modal representation consistency. In practice, a pair of multi-modal images, i.e. NBI and WL, are fed into a shared Transformer to extract hierarchical feature representations. Then a novel designed Spatial Attention Module (SAM) is adopted to calculate the similarities between class token and patch tokens for a specific modality image. By aligning the class tokens and spatial attention maps of paired NBI and WL images at different levels, the Transformer achieves the ability to keep both global and local representation consistency for the above two modalities. Extensive experimental results illustrate the proposed method outperforms the recent studies with a margin, realizing multi-modal prediction with a single Transformer while greatly improving the classification accuracy when only with WL images.

show abstract

Multi-compound Transformer for Accurate Biomedical Image Segmentation

Cited by 92 publications

References 25 publications

TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers

TransFusion: Multi-view Divergent Fusion for Medical Image Segmentation with Transformers

Memory-efficient Segmentation of High-resolution Volumetric MicroCT Images

Toward Clinically Assisted Colorectal Polyp Recognition via Structured Cross-modal Representation Consistency

Contact Info

Product

Resources

About