In the area of video summarization applications, automatic image caption synthesis using deep learning is a promising approach. This methodology utilizes the capabilities of neural networks to autonomously produce detailed textual descriptions for significant frames or instances in a video. Through the examination of visual elements, deep learning models possess the capability to discern and classify objects, scenarios, and actions, hence enabling the generation of coherent and useful captions. This paper presents a novel methodology for generating image captions in the context of video summarizing applications. DenseNet201 architecture is used to extract image features, enabling the effective extraction of comprehensive visual information from keyframes in the videos. In text processing, GloVe embedding, which is pre-trained word vectors that capture semantic associations between words, is employed to efficiently represent textual information. The utilization of these embeddings establishes a fundamental basis for comprehending the contextual variations and semantic significance of words contained within the captions. LSTM models are subsequently utilized to process the GloVe embeddings, facilitating the development of captions that keep coherence, context, and readability. The integration of GloVe embeddings with LSTM models in this study facilitates the effective fusion of visual and textual data, leading to the generation of captions that are both informative and contextually relevant for video summarization. The proposed model significantly enhances the performance by combining the strengths of convolutional neural networks for image analysis and recurrent neural networks for natural language generation. The experimental results demonstrate the effectiveness of the proposed approach in generating informative captions for video summarization, offering a valuable tool for content understanding, retrieval, and recommendation.