Current marine research that leverages remote sensing data urgently requires gridded data of high spatial and temporal resolution. However, such high-quality data is often lacking due to the inherent physical and technical constraints of sensors. A necessary trade-off therefore exists between spatial, temporal, and spectral resolution in satellite remote sensing technology: increasing spatial resolution often reduces the coverage area, thereby diminishing temporal resolution. This manuscript introduces an innovative remote sensing image fusion algorithm that combines Sentinel-2 (high spatial resolution) and Sentinel-3 (relatively high spectral and temporal resolution) satellite data. The algorithm, based on a cross-attention mechanism and referred to as the Cross-Attention Spatio-Temporal Spectral Fusion (CASTSF) model, accounts for variations in spectral channels, spatial resolution, and temporal phase among different sensor images. The proposed method enables the fusion of atmospherically corrected ocean remote sensing reflectance products (Level 2 OSR), yielding high-resolution spatial data at 10 m resolution with a temporal frequency of 1–2 days. Subsequently, the algorithm generates chlorophyll-a concentration remote sensing products characterized by enhanced spatial and temporal fidelity. A comparative analysis against existing chlorophyll-a concentration products demonstrates the robustness and effectiveness of the proposed approach, highlighting its potential for advancing remote sensing applications.