Automated assessment of tomato crop maturity is vital for improving agricultural productivity and reducing food waste. Traditionally, farmers have relied on visual inspection and manual assessment to predict tomato maturity, which is prone to human error and time-consuming. Computer vision and deep learning automate this process by analysing visual characteristics, enabling data-driven harvest decisions, optimising quality, and reducing waste for sustainable and efficient agriculture. This research demonstrates deep learning models accurately classifying tomato maturity stages using computer vision techniques, utilising a novel dataset of 4,353 tomato images. The Vision Transformer (ViT) model exhibited superior performance in classifying tomatoes into three ripeness categories (immature, mature, and partially mature), achieving a remarkable testing accuracy of 98.67% and the Convolution neural network (CNN) models, including EfficientNetB1, EfficientNetB5, EfficientNetB7, InceptionV3, ResNet50, and VGG16, achieved testing accuracies of 88.52%, 89.84%, 91.16%, 90.94%, 93.15%, and 92.27%, respectively, when tested with unseen data. ViT significantly surpassed the performance of CNN models. This research highlights the potential for deploying ViT in agricultural environments to monitor tomato maturity stages and packaging facilities smartly. Transformer-based systems could substantially reduce food waste and improve producer profits and productivity by optimising fruit harvest time and sorting decisions.