Multi-stage feature fusion is pretty effective for deep Siamese trackers to promote tracking performance. Unfortunately, conventional fusion approaches, such as weighted average, are so simple that they are inappropriate to combine the features with diverse characteristics. In addition, the fusion module is generally optimized along with Siamese network module, which may result in the performance degradation of the whole tracker. In this paper, we propose a novel feature fusion network for Siamese tracker by exploiting the expression capacity of residual learning (SiamRFL). Specifically, the network employs the deep-layer features as direct input to semantically recognize the object from background, and refines the object state with local detail patterns by exploring the shallow-layer features through residual channel. The classification and the regression features can be fused respectively by deploying multiple fusion units. To avoid the degradation problem, we also present an ensemble training framework for our tracker, in which different loss functions are introduced to individually optimize the Siamese and the fusion modules. In the extensive experiments on several latest datasets including OTB100, VOT2019, UAV123, LaSOT and GOT10k, the proposed tracker achieves state-of-the-art performance, outperforming other approaches by an obvious margin.