Spatial transcriptomics (ST) technologies have transformed our ability to study tissue architecture by capturing gene expression profiles along with their spatial context. However, the high-dimensional ST data often have limited spatial resolution and exhibit considerable noise and sparsity, thus posing significant challenges for deciphering subtle spatial patterns. To address these challenges, we introduce DeepFuseNMF, a novel multi-modal dimensionality reduction framework that enhances spatial resolution by integrating low-resolution ST data with high-resolution histology images. DeepFuseNMF incorporates nonnegative matrix factorization into a neural network architecture for interpretable high-resolution embedding learning and spatial domain detection. Furthermore, DeepFuseNMF seamlessly handles multiple samples simultaneously and is compatible with various types of histology images. Extensive evaluations using synthetic datasets and real ST datasets from various technologies and tissue types demonstrate DeepFuseNMF’s ability to produce highly interpretable high-resolution embeddings and to detect refined spatial structures. DeepFuseNMF represents a powerful multi-modal data integration approach to fully leverage the rich information in ST data and associated high-resolution image data, paving the way for understanding the organization and function of complex tissue structures.