Remote sensing image fusion (also known as pan-sharpening) aims at generating high resolution multi-spectral (MS) image from inputs of a high spatial resolution single band panchromatic (PAN) image and a low spatial resolution multi-spectral image. Inspired by the astounding achievements of convolutional neural networks (CNNs) in a variety of computer vision tasks, in this paper, we propose a two-stream fusion network (TFNet) to address the problem of pan-sharpening. Unlike previous CNN based methods that consider pan-sharpening as a super resolution problem and perform pan-sharpening in pixel level, the proposed TFNet aims to fuse PAN and MS images in feature level and reconstruct the pan-sharpened image from the fused features. The TFNet mainly consists of three parts. The first part is comprised of two networks extracting features from PAN and MS images, respectively. The subsequent network fuses them together to form compact features that represent both spatial and spectral information of PAN and MS images, simultaneously. Finally, the desired high spatial resolution MS image is recovered from the fused features through an image reconstruction network. Experiments on Quickbird and GaoFen-1 satellite images demonstrate that the proposed TFNet can fuse PAN and MS images, effectively, and produce pan-sharpened images competitive with even superior to state of the arts.