Image captioning is a formidable challenge in the realms of computer vision and natural language processing. The limited availability of captioning mechanisms for non-English languages poses significant barriers for nonproficient English speakers. Specifically, Tamil and Telugu the most spoken languages in India, lack image captioning models capable of delivering accurate captions. Moreover, generating captions in Tamil and Telugu is a complex task due to their unique linguistic intricacies. Addressing these challenges requires advanced models capable of capturing long-range dependencies and generating contextually meaningful image captions. This research presents a multimodal deep learning framework that appropriately integrates InceptionV3, VGG16, and ResNet50 convolutional neural network architectures with multihead attention-based transformer architecture. By harnessing the multihead attention mechanism, our model effectively comprehends image context, handles linguistic complexity, and establishes vital multimodal associations between visual and textual features. Extensive experiments were carried out on translated versions of the benchmark datasets such as Flickr8k, Flickr30k, and MSCOCO to evaluate the efficacy of the model. The multimodal technique we suggested produced extraordinary and remarkable results especially in terms of BLEU metrics. The maximum value of the BLEU-1 score achieved by our suggested model reaches 65.16 and 66.79 on Tamil and Telugu caption generation tasks respectively. These findings outperformed the results of existing methods, indicating improved accuracy in generating captions for both Tamil and Telugu languages. Furthermore, a meticulous, manually labored audit of the generated captions confirmed their appropriateness and competence, affirming the robustness of the proposed methodology.