In autofocus systems, the rapid and precise determination of the focusing distance is crucial for capturing clear images, presenting a challenge that requires efficient and effective algorithms. To enhance this process, we introduce a novel autofocus mechanism employing a deep learning approach based on an optimized VGG network architecture, designated as ST-VGG. This method adapts the VGG16 framework by refining its convolutional layers and optimizing parameters to improve efficiency, notably incorporating global average pooling to reduce overfitting risks. Moreover, we integrate the Inception architecture to enable feature assimilation across multiple scales without additional computational burdens, enhancing the model's analytical depth and efficiency. Significantly, by adopting depthwise separable convolution, we achieve a reduction in model complexity and computational demand, making ST-VGG more suitable for mobile and embedded applications. We frame autofocus as a regression problem and, through extensive validation against a comprehensive dataset, demonstrate that our model outperforms existing compact networks (TVGG, TSwinT, TViT) and established architectures (LeNet, AlexNet) in terms of prediction accuracy and consistency, with minimal bias. The ST-VGG model markedly reduces the number of parameters required for training, decreases validation loss by 0.02, and achieves an impressive inference speed of 1.4 ms per image. These advancements not only elevate the model's efficacy but also mitigate the GPU memory requirements for neural network training, offering significant potential benefits for applications in life sciences and micro robotics.