Under the strict carbon emission quota policy in China, the urban carbon price directly affects the operation of enterprises, as well as forest carbon sequestration. As a result, accurately forecasting carbon prices has been a popular research topic in forest science. Similar to stock prices, urban carbon prices are difficult to forecast using simple models with only historical prices. Fortunately, urban remote sensing images containing rich human economic activity information reflect the changing trend of carbon prices. However, properly integrating remote sensing data into carbon price forecasting has not yet been investigated. In this study, by introducing the powerful transformer paradigm, we propose a novel carbon price forecasting method, called MFTSformer, to uncover information from urban remote sensing and historical price data through the encoder–decoder framework. Moreover, a self-attention mechanism is used to capture the intrinsic characteristics of long-term price data. We conduct comparison experiments with four baselines, ablation experiments, and case studies in Guangzhou. The results show that MFTSformer reduces errors by up to 52.24%. Moreover, it outperforms the baselines in long-term accurate carbon price prediction (averaging 15.3%) with fewer training resources (it converges rapidly within 20 epochs). These findings suggest that the effective MFTSformer can offer new insights regarding AI to urban forest research.