Video fire detection (VFD) technology has shown a broad application prospect with the popularization of camera monitoring systems. Since the initial stage is the best time for firefighting, it's crucial to develop a robust algorithm for early warning. In this paper, an efficient VFD fusion algorithm is presented. First, the fire candidate areas (FCA) are located quickly based on low-level visual features to guarantee well timeliness. Furthermore, a multi-scale convolutional neural network with spatial pyramid pooling is built and trained on the dedicated flame data set without the requirement of sample labeling. In this case, FCA with different aspect ratios and scales can be accurately identified. The method is fully tested on various databases of arbitrarysized images and videos. Experimental results show that the proposed fusion algorithm, in addition to improving the detection efficiency, also ensures the accurate identification of flames on different scales.