The discipline of automatic image captioning represents an integration of two pivotal branches of artificial intelligence, namely computer vision (CV) and natural language processing (NLP). The principal functionality of this technology lies in transmuting the extracted visual features into semantic information of a higher order. The bidirectional long short-term memory (Bi-LSTM) has garnered wide acceptance in executing image captioning tasks. Of late, scholarly attention has been focused on modifying suitable models for innovative and precise subtitle captions, although tuning the parameters of the model does not invariably yield optimal outcomes. Given this, the current research proposes a model that effectively employs the bidirectional LSTM and attention mechanism (Bi-LS-AttM) for image captioning endeavors. This model exploits the contextual comprehension from both anterior and posterior aspects of the input data, synergistically with the attention mechanism, thereby augmenting the precision of visual language interpretation. The distinctiveness of this research is embodied in its incorporation of Bi-LSTM and the attention mechanism to engender sentences that are both structurally innovative and accurately reflective of the image content. To enhance temporal efficiency and accuracy, this study substitutes convolutional neural networks (CNNs) with fast region-based convolutional networks (Fast RCNNs). Additionally, it refines the process of generation and evaluation of common space, thus fostering improved efficiency. Our model was tested for its performance on Flickr30k and MSCOCO datasets (80 object categories). Comparative analyses of performance metrics reveal that our model, leveraging the Bi-LS-AttM, surpasses unidirectional and Bi-LSTM models. When applied to caption generation and image-sentence retrieval tasks, our model manifests time economies of approximately 36.5% and 26.3% vis-a-vis the Bi-LSTM model and the deep Bi-LSTM model, respectively.