With the development of deep learning technology, the application of convolutional neural network (CNN) and vision transformer (ViT) for polarimetric synthetic aperture radar (PolSAR) image classification has been deepened. However, the PolSAR image has very rich information due to its special data form, which makes it difficult for the existing single network structure to comprehensively extract such effective information. In addition, deep learning methods require a large amount of data for training, whereas PolSAR labeled data is scarce and difficult to obtain. Therefore, a multi-granularity hybrid CNN-ViT model based on external tokens and cross-attention is proposed for PolSAR image classification. First of all, CNN is able to learn local features very well. Thus, a CNN-based external feature extractor is designed to extract local features from the PolSAR image. Then, ViT can focus on global features. So, a multi-granularity attention structure is constructed for extracting global information at multiple scales. With these two modules, the model can fully access the feature information contained in PolSAR images, which is more advantageous than a single network structure. Next, to further utilize these features, a crossattention feature fusion module is built for fusing global-local information of different granularities. Finally, by connecting with the softmax classifier, the network outputs the final prediction results. Experimental results on three benchmark datasets show that the present method using a small amount of labeled data for training also achieves the highest level of classification among the compared methods.