Cross-modal retrieval is a very challenging and significant task in intelligent understanding. Researchers have tried to capture modal semantic information through a weighted attention mechanism. Still, they cannot eliminate irrelevant semantic information's negative effects and cannot capture fine-grained modal semantic information. In order to further accurately capture the multi-modal semantic information, a bidirectional focused semantic alignment attention network (BFSAAN) is proposed to handle cross-modal retrieval tasks. Core ideas of BFSAAN are as follows: 1) Bidirectional focused attention mechanism is adopted to share modal semantic information, further eliminating the negative influence of irrelevant semantic information. 2) Strip pooling is applied to image and text modalities, a lightweight spatial attention mechanism to capture modal spatial semantic information. 3) Second-order covariance pooling is explored to obtain multi-modal semantic representation, capturing modal channel semantic information and achieving semantic alignment between image-text modalities. The experiment is executed in two standard cross-modal retrieval datasets (Flickr30K and MS COCO). The experimental design includes four aspects: performance comparison, ablation analysis, algorithm convergence, and visual analysis. Experimental results show that BFSAAN has better crossmodal retrieval performance.