Serous cavity effusion is a prevalent pathological condition encountered in clinical settings. Fluid samples obtained from these effusions are vital for diagnostic and therapeutic purposes. Traditionally, cytological examination of smears is a common method for diagnosing serous cavity effusion, renowned for its convenience. However, this technique presents limitations that can compromise its efficiency and diagnostic accuracy. This study aims to overcome these challenges and introduce an improved method for the precise detection of malignant cells in serous cavity effusions. We have developed a transformer-based classification framework, specifically employing the vision transformer (ViT) model, to fulfill this objective. Our research involved collecting smear images and corresponding cytological reports from 161 patients who underwent serous cavity drainage. We meticulously annotated 4836 patches from these images, identifying regions with and without malignant cells, thus creating a unique dataset for smear image classification. The findings of our study reveal that deep learning models, particularly the ViT model, exhibit remarkable accuracy in classifying patches as malignant or non-malignant. The ViT model achieved an impressive area under the receiver operating characteristic curve (AUROC) of 0.99, surpassing the performance of the convolutional neural network (CNN) model, which recorded an AUROC of 0.86. Additionally, we validated our models using an external cohort of 127 patients. The ViT model sustained its high-level screening performance, achieving an AUROC of 0.98 at the patient level, compared to the CNN model’s AUROC of 0.84. The visualization of our ViT models confirmed their capability to precisely identify regions containing malignant cells in multiple serous cavity effusion smear images. In summary, our study demonstrates the potential of deep learning models, particularly the ViT model, in automating the screening process for serous cavity effusions. These models offer significant assistance to cytologists in enhancing diagnostic accuracy and efficiency. The ViT model stands out for its advanced self-attention mechanism, making it exceptionally suitable for tasks that necessitate detailed analysis of small, sparsely distributed targets like cellular clusters in serous cavity effusions.