Tumor morphological features from histology images are a cornerstone of clinical pathology, diagnostic biomarkers, and basic cancer biology research. Spatial transcriptomics, which provides spatially resolved gene expression profiles overlaid on histology images, offers a unique opportunity to integrate morphological and expression features, thereby deepening our understanding of tumor biology. However, spatial transcriptomics experiments with patient samples in either clinical trials or clinical care are costly and challenging, whereas histology images are generated routinely and available for many legacy prospective cohorts of disease progression and outcomes in well-annotated cohorts. Inferring spatial transcriptomics profiles computationally from these histology images would significantly expand our understanding of tumor biology, but paired data for training multi-modal spatial-histology models remains limited. Here, we tackle this challenge by incorporating performant foundation models pre-trained on massive datasets of pathology images and single-cell RNA-Seq, respectively, which provide useful embeddings to underpin multi-modal models. To this end, we developed PathOmCLIP, a model trained with contrastive loss to create a joint-embedding space between a histopathology foundation model and a single-cell RNA-seq foundation model. We incorporate a set transformer to gather localized neighborhood tumor architecture following contrastive training, which further enhances performance and is necessary to obtain robust results. We validate PathOmCLIP across five tumor types and achieve significant performance improvements in gene expression prediction tasks over other methods. PathOmCLIP can be applied to many archived histology images, unlocking valuable clinical information and facilitating new biomarker discoveries.