As the lakes located in the Qinghai-Tibet Plateau are important carriers of water resources in Asia, dynamic changes to these lakes intuitively reflect the climate and water resource variations of the Qinghai-Tibet Plateau. To address the insufficient performance of the Convolutional Neural Network (CNN) in learning the spatial relationship between long-distance continuous pixels, this study proposes a water recognition model for lakes on the Qinghai-Tibet Plateau based on U-Net and ViTenc-UNet. This method uses Vision Transformer (ViT) to replace the continuous Convolutional Neural Network layer in the encoder of the U-Net model, which can more accurately identify and extract the continuous spatial relationship of lake water bodies. A Convolutional Block Attention Module (CBAM) mechanism was added to the decoder of the model enabling the spatial information and spectral information characteristics of the water bodies to be more completely preserved. The experimental results show that the ViTenc-UNet model can complete the task of lake water recognition on the Qinghai-Tibet Plateau more efficiently, and the Overall Accuracy, Intersection over Union, Recall, Precision, and F1 score of the classification results for lake water bodies reached 99.04%, 98.68%, 99.08%, 98.59%, and 98.75%, which were, respectively, 4.16%, 6.20% 5.34%, 4.80%, and 5.34% higher than the original U-Net model. Compared to FCN, the DeepLabv3+, TransUNet, and Swin-Unet models also have different degrees of advantages. This model innovatively introduces ViT and CBAM into the water extraction task of lakes on the Qinghai-Tibet Plateau, showing excellent water classification performance of these lake bodies. This method has certain classification advantages and will provide an important scientific reference for the accurate real-time monitoring of important water resources on the Qinghai-Tibet Plateau.