The rapid development of spatial transcriptomics allows for the measurement of RNA abundance at a high spatial resolution, making it possible to simultaneously profile gene expression, spatial locations, and the corresponding hematoxylin and eosin-stained histology images. Since histology images are relatively easy and cheap to obtain, it is promising to leverage histology images for predicting gene expression. Though several methods have been devised to predict gene expression using histology images, they don’t simultaneously include the 2D vision features and the spatial dependency, limiting their performances. Here, we have developed Hist2ST, a deep learning-based model using histology images to predict RNA-seq expression. At each sequenced spot, the corresponding histology image is cropped into an image patch, from which 2D vision features are learned through convolutional operations. Meanwhile, the spatial relations with the whole image and neighbored patches are captured through Transformer and graph neural network modules, respectively. These learned features are then used to predict the gene expression by following the zero-inflated negative binomial (ZINB) distribution. To alleviate the impact by the small spatial transcriptomics data, a self-distillation mechanism is employed for efficient learning of the model. Hist2ST was tested on the HER2-positive breast cancer and the cutaneous squamous cell carcinoma datasets, and shown to outperform existing methods in terms of both gene expression prediction and following spatial region identification. Further pathway analyses indicated that our model could reserve biological information. Thus, Hist2ST enables generating spatial transcriptomics data from histology images for elucidating molecular signatures of tissues.