Siamese‐based trackers have demonstrated robust performance in object tracking, while Transformers have achieved widespread success in object detection. Currently, many researchers use a hybrid structure of convolutional neural networks and Transformers to design the backbone network of trackers, aiming to improve performance. However, this approach often underutilises the global feature extraction capability of Transformers. The authors propose a novel Transformer‐based tracker that fuses spatial and temporal features. The tracker consists of a multilayer spatial feature fusion network (MSFFN), a temporal feature fusion network (TFFN), and a prediction head. The MSFFN includes two phases: feature extraction and feature fusion, and both phases are constructed with a Transformer. Compared with the hybrid structure of “CNNs + Transformer,” the proposed method enhances the continuity of feature extraction and the ability of information interaction between features, enabling comprehensive feature extraction. Moreover, to consider the temporal dimension, the authors propose a TFFN for updating the template image. The network utilises the Transformer to fuse the tracking results of multiple frames with the initial frame, allowing the template image to continuously incorporate more information and maintain the accuracy of target features. Extensive experiments show that the tracker STFT achieves state‐of‐the‐art results on multiple benchmarks (OTB100, VOT2018, LaSOT, GOT‐10K, and UAV123). Especially, the tracker STFT achieves remarkable area under the curve score of 0.652 and 0.706 on the LaSOT and OTB100 benchmark respectively.