BackgroundDynamic contrast enhanced magnetic resonance imaging (DCE‐MRI) plays a crucial role in the diagnosis and measurement of hepatocellular carcinoma (HCC). The multi‐modality information contained in the multi‐phase images of DCE‐MRI is important for improving segmentation. However, this remains a challenging task due to the heterogeneity of HCC, which may cause one HCC lesion to have varied imaging appearance in each phase of DCE‐MRI. In particular, some phases exhibit inconsistent sizes and boundaries will result in a lack of correlation between modalities, and it may pose inaccurate segmentation results.PurposeWe aim to design a multi‐modality segmentation model that can learn meaningful inter‐phase correlation for achieving HCC segmentation.MethodsIn this study, we propose a two‐stage progressive attention segmentation framework (TPA) for HCC based on the transformer and the decision‐making process of radiologists. Specifically, the first stage aims to fuse features from multi‐phase images to identify HCC and provide localization region. In the second stage, a multi‐modality attention transformer module (MAT) is designed to focus on the features that can represent the actual size.ResultsWe conduct training, validation, and test in a single‐center dataset (386 cases), followed by external test on a batch of multi‐center datasets (83 cases). Furthermore, we analyze a subgroup of data with weak inter‐phase correlation in the test set. The proposed model achieves Dice coefficient of 0.822 and 0.772 in the internal and external test sets, respectively, and 0.829, 0.791 in the subgroup. The experimental results demonstrate that our model outperforms state‐of‐the‐art models, particularly within subgroup.ConclusionsThe proposed TPA provides best segmentation results, and utilizing clinical prior knowledge for network design is practical and feasible.