In urban scenes, street scene text images often contain rich semantic information, which is a key cue for perceiving and understanding the scene. Recovering high-resolution text image from low-resolution text image is a challenging task due to the complex background of street scene text images, irregular text shapes, image blurring, distortion and deformation. Existing approaches mainly use recurrent neural networks to mine text-specific contextual information, which cannot effectively capture long-range correlations in text images and fail to effectively utilize the semantic information of text images. In addition, they tend to have weak generalization when dealing with cross-linguistics. Aiming at the above problems, we propose a text-image super-resolution method that fuses self-attention and scene prior. Firstly, a text recognition network and a semantic segmentation network are used to extract prior features, which are fused with the image visual features through a prior interpreter to realize the effective utilization of text semantic and visual information. Then the recursive information in the text line is extracted by Transformer, and the sequence information is extracted by using the global visibility of multi-head attention to establish the correlation between the front and back characters, to solve the problem of performance degradation when the model is processing long text. Finally, the ability of the model to extract text contour and process deformed text is enhanced by gradient profile loss and text structure-aware loss. The experimental results show that the recognition accuracy of our method in TextZoom's three types of test sets is higher than that of the baseline model TSRN 4%, and the average values of the metrics in PSNR and SSIM reach 21.34 and 78.39, which effectively improves the application effect of the text super-resolution model in real scenes.