2021
DOI: 10.1007/978-3-030-87193-2_31
|View full text |Cite
|
Sign up to set email alerts
|

Multi-compound Transformer for Accurate Biomedical Image Segmentation

Abstract: The recent vision transformer (i.e. for image classification) learns nonlocal attentive interaction of different patch tokens. However, prior arts miss learning the cross-scale dependencies of different pixels, the semantic correspondence of different labels, and the consistency of the feature representations and semantic embeddings, which are critical for biomedical segmentation. In this paper, we tackle the above issues by proposing a unified transformer network, termed Multi-Compound Transformer (MCTrans), … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
41
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
4
1

Relationship

1
8

Authors

Journals

citations
Cited by 92 publications
(41 citation statements)
references
References 25 publications
0
41
0
Order By: Relevance
“…ResUNet applies residual blocks as UNet building blocks [22]. MC- Trans [13] introduces a cross-attention block between the encoder and decoder to gather cross-scale dependencies of the feature maps. Refined DLA (rDLA) [19] bases its backbone on a leading CNN architecture, Deep Layer Aggregation (DLA) [28], and aggregates context information from cross-view through a refinement stage.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…ResUNet applies residual blocks as UNet building blocks [22]. MC- Trans [13] introduces a cross-attention block between the encoder and decoder to gather cross-scale dependencies of the feature maps. Refined DLA (rDLA) [19] bases its backbone on a leading CNN architecture, Deep Layer Aggregation (DLA) [28], and aggregates context information from cross-view through a refinement stage.…”
Section: Resultsmentioning
confidence: 99%
“…UTNet [8] instead incorporates interleaved transformer blocks and convolution blocks for small medical dataset. MC-Trans [13] employs a Transformer Cross-Attention (TCA) module to collect context information from feature maps of different scales. However, these approaches are designed for single-view and can be sub-optimal for complex segmentation tasks due to the absence of considering semantic dependencies of different scales and views, which are critical for enhancing clinical lesion assessment.…”
Section: Introductionmentioning
confidence: 99%
“…Recently, the Transformer-based architecture has shown excellent success (Dosovitskiy et al, 2021). A commonly adopted strategy for image segmentation is to take a hybrid CNN-Transformer-based architecture (Xie et al, 2021;Ji et al, 2021). (Chen et al, 2021) proposed TransUnet structure that embeds Transformer in the encoder to enhance the longdistance dependency in features for 2D image segmentation tasks.…”
Section: Related Workmentioning
confidence: 99%
“…Currently, Dosovitskiy et al [6] applied Transformer architecture from NLP to computer vision as Vision Transformer (ViT), and showed that the sequences of image patches could perform very well with a pure transformer on image classification, while the convolutional networks usually suffer from difficulty in capturing and storing long-distance dependent information due to the limited receptive field. Following the ViT, many other vision transformer variants are proposed [4,17], and some of them have achieved great performance on various medical tasks [3,10,23,24,28] with the strong representation capabilities of transformer.…”
Section: Related Workmentioning
confidence: 99%