BackgroundPolyp detection and localization are essential tasks for colonoscopy. U‐shape network based convolutional neural networks have achieved remarkable segmentation performance for biomedical images, but lack of long‐range dependencies modeling limits their receptive fields.PurposeOur goal was to develop and test a novel architecture for polyp segmentation, which takes advantage of learning local information with long‐range dependencies modeling.MethodsA novel architecture combining with multi‐scale nested UNet structure integrated transformer for polyp segmentation was developed. The proposed network takes advantage of both CNN and transformer to extract distinct feature information. The transformer layer is embedded between the encoder and decoder of a U‐shape net to learn explicit global context and long‐range semantic information. To address the challenging of variant polyp sizes, a MSFF unit was proposed to fuse features with multiple resolution.ResultsFour public datasets and one in‐house dataset were used to train and test the model performance. Ablation study was also conducted to verify each component of the model. For dataset Kvasir‐SEG and CVC‐ClinicDB, the proposed model achieved mean dice score of 0.942 and 0.950 respectively, which were more accurate than the other methods. To show the generalization of different methods, we processed two cross dataset validations, the proposed model achieved the highest mean dice score. The results demonstrate that the proposed network has powerful learning and generalization capability, significantly improving segmentation accuracy and outperforming state‐of‐the‐art methods.ConclusionsThe proposed model produced more accurate polyp segmentation than current methods on four different public and one in‐house datasets. Its capability of polyps segmentation in different sizes shows the potential clinical application