Spatial transcriptomics (ST technology allows for the detection of cellular transcriptome information while preserving the spatial location of cells. This capability enables researchers to better understand the cellular heterogeneity, spatial organization, and functional interactions in complex biological systems. However, current technological methods are limited by low resolution, which reduces the accuracy of gene expression levels. Here, we propose scstGCN, a multimodal information fusion method based on Vision Transformer and Graph Convolutional Network that integrates histological images, spot-based ST data and spatial location information to infer super-resolution gene expression profiles at single-cell level. We evaluated the accuracy of the super-resolution gene expression profiles generated on diverse tissue ST datasets with disease and healthy by scstGCN along with their performance in identifying spatial patterns, conducting functional enrichment analysis, and tissue annotation. The results show that scstGCN can predict super-resolution gene expression accurately and aid researchers in discovering biologically meaningful differentially expressed genes and pathways. Additionally, scstGCN can segment and annotate tissues at a finer granularity, with results demonstrating strong consistency with coarse manual annotations. Our source code and all used datasets are available at https://github.com/wenwenmin/scstGCN and https://zenodo.org/records/12800375.