The process of transforming input images into corresponding textual explanations stands as a crucial and complex endeavor within the domains of computer vision and natural language processing. In this paper, we propose an innovative ensemble approach that harnesses the capabilities of Contrastive Language-Image Pretraining (CLIP) models. Our ensemble framework encompasses two significant variations of the CLIP model, each meticulously designed to cater to specific nuances within the image-to-text transformation landscape. The first model introduces an elaborated architecture, featuring multiple layers with distinct learning rates, thereby amplifying its adeptness in capturing intricate relationships between images and text. The second model strategically exploits CLIP's inherent zero-shot learning potential to generate image-text embeddings, subsequently harnessed by a K-Nearest Neighbors (KNN) model. Through this KNN-based paradigm, the model facilitates image-to-text transformation by identifying closely related embeddings within the embedding space. Notably, our ensemble approach is rigorously evaluated, employing the cosine similarity metric to gauge the alignment between model-generated embeddings and ground truth representations. Comparative experiments vividly highlight the superiority of our ensemble strategy over standalone CLIP models. This study not only advances the state-of-the-art in image-to-text transformation but also accentuates the promising trajectory of ensemble learning in effectively addressing intricate multimodal tasks.