Aiming at the problems of insufficient network fusion and low detection efficiency in current object recognition using RGB-D images, a recognition algorithm based on the medium-level layer-by-layer fusion of dual-channel networks is proposed. First of all, the RGB and Depth networks are trained with ten labelled RGB-D indoor objects respectively, and then determine the fusion coefficients according to the identify accuracy of two types networks. Finally, two kinds of features are merged in convolutional layers step by step to obtain the final weights. By testing on the challenging NYU Depth v2 dataset, we found that the recognition accuracy of our method is 92.85%, and average detection time is 61.03ms per image. Through comparison experiments, we got the conclusion that average accuracy of the RGB-D layer-by-layer fusion network is 5.22% higher than that of the RGB network.