Vision Transformer (ViT) has become the most popular architecture for existing vision tasks, but it is difficult to apply to the industrial domain due to its heavy computational cost of its self-attention mechanism. Masked AutoEncoder (MAE) has recently led the trend of self-supervised learning with a simple, scalable, and efficient ViT-based asymmetric encoder-decoder architecture. To mitigate the quadratic complexity of self-attention, we design a hybrid convolution-transformer pyramid network that effectively combines the respective advantages of convolution and self-attention. However, it is still unclear how our convolution-transformer pyramid network can be adopted in MAE pre-training, as it uses the local convolution operation, making it difficult to handle random sequences with only partial visual tokens. In this paper, we present a novel and efficient masked image modeling (MIM) approach, convolutional-contextual transformer masked autoencoder (CoTMAE). The pipeline of CoTMAE consists of: (i) a window masking (WM) strategy that ensures computational efficiency, (ii) an encoder that only takes visible patches as input to our hybrid convolution-transformer network, (iii) a multi-scale fusion module that enhances the output features of the encoder, which allows the decoder to focus on the reconstruction task. (iv) a feature alignment module that handles the distribution of encoded features and masked patches, and (v) a decoder that reconstructs the missing pixels of the masked patches. Specifically, WM directly divides the original image into equal-sized windows, using a random mask strategy within each window. Afterwards, only visible patches are reordered and reorganized into images as input to the hybrid convolution-transformer pyramid network. Our WM significantly improves the training efficiency of hybrid convolution-transformer networks and reduces GPU memory, while maintaining a competitive advantage with supervised training models in downstream tasks. We demonstrate that CoTMAE successfully enables self-supervised pre-training of a hybrid convolution-transformer pyramid network and achieves good fine-tuning performance on instance segmentation datasets. The encoder of CoTMAE is trained on ImageNet-1K dataset classification and fine-tuned on COCO 2017 dataset to achieve 52.9% APbox and 45.8% APmask. On industrial instance segmentation datasets, CoTMAE shows better fine-tuning performance than supervised models.