Background: Access to oral healthcare is not uniform globally, particularly in rural areas with limited resources, which limits the potential of automated diagnostics and advanced tele-dentistry applications. The use of digital caries detection and progression monitoring through photographic communication, is influenced by multiple variables that are difficult to standardize in such settings. The objective of this study was to develop a novel and cost-effective virtual computer vision AI system to predict dental cavitations from non-standardised photographs with reasonable clinical accuracy. Methods: A set of 1703 augmented images was obtained from 233 de-identified teeth specimens. Images were acquired using a consumer smartphone, without any standardised apparatus applied. The study utilised state-of-the-art ensemble modeling, test-time augmentation, and transfer learning processes. The “you only look once” algorithm (YOLO) derivatives, v5s, v5m, v5l, and v5x, were independently evaluated, and an ensemble of the best results was augmented, and transfer learned with ResNet50, ResNet101, VGG16, AlexNet, and DenseNet. The outcomes were evaluated using precision, recall, and mean average precision (mAP). Results: The YOLO model ensemble achieved a mean average precision (mAP) of 0.732, an accuracy of 0.789, and a recall of 0.701. When transferred to VGG16, the final model demonstrated a diagnostic accuracy of 86.96%, precision of 0.89, and recall of 0.88. This surpassed all other base methods of object detection from free-hand non-standardised smartphone photographs. Conclusion: A virtual computer vision AI system, blending a model ensemble, test-time augmentation, and transferred deep learning processes, was developed to predict dental cavitations from non-standardised photographs with reasonable clinical accuracy. This model can improve access to oral healthcare in rural areas with limited resources, and has the potential to aid in automated diagnostics and advanced tele-dentistry applications.