The combination of multi-modal image fusion schemes with deep learning classification methods, and particularly with Convolutional Neural Networks (CNNs) has achieved remarkable performances in the pedestrian detection field. The late fusion scheme has significantly enhanced the performance of the pedestrian recognition task. In this paper, the late fusion scheme connected with CNN learning is deeply investigated for pedestrian recognition based on the Daimler stereo vision dataset. Thus, an independent CNN for each imaging modality (Intensity, Depth, and Optical Flow) is used before the fusion of the CNN's probabilistic output scores with a Multi-Layer Perceptron which provides the recognition decision. We propose four different learning patterns based on Cross-Modality deep learning of Convolutional Neural Networks: (1) a Particular Cross-Modality Learning; (2) a Separate Cross-Modality Learning; (3) a Correlated Cross-Modality Learning and ( 4) an Incremental Cross-Modality Learning model. Moreover, we also design a new CNN architecture, called LeNet+, which improves the classification performance not only for each modality classifier, but also for the multi-modality late-fusion scheme. Finally, we propose to learn the LeNet+ model with the incremental cross-modality approach using optimal learning settings, obtained with a K-fold Cross Validation pattern. This method outperforms the state-of-the-art classifier provided with Daimler datasets on both non-occluded and partially-occluded pedestrian tasks.