Identifying and managing osteosarcoma pose significant challenges, especially in resource-constrained developing nations. Advanced diagnostic methods involve isolating the nucleus from cancer cells for comprehensive analysis. However, two main challenges persist: mitigating image noise during the capture and transmission of cellular sections, and providing an efficient, accurate, and cost-effective solution for cell nucleus segmentation. To tackle these issues, we introduce the Twin-Self and Cross-Attention Vision Transformer (TSCA-ViT). This pioneering AI-based system employs a directed filtering algorithm for noise reduction and features an innovative transformer architecture with a twin attention mechanism for effective segmentation. The model also incorporates cross-attention-enabled skip connections to augment spatial information. We evaluated our method on a dataset of 1000 osteosarcoma pathology slide images from the Second People’s Hospital of Huaihua, achieving a remarkable average precision of 97.7%. This performance surpasses traditional methodologies. Furthermore, TSCA-ViT offers enhanced computational efficiency owing to its fewer parameters, which results in reduced time and equipment costs. These findings underscore the superior efficacy and efficiency of TSCA-ViT, offering a promising approach for addressing the ongoing challenges in osteosarcoma diagnosis and treatment, particularly in settings with limited resources.