Fruit maturity is a critical factor in the supply chain, consumer preference, and agriculture industry. Most classification methods on fruit maturity identify only two classes: ripe and unripe, but this paper estimates six maturity stages of papaya fruit. Deep learning architectures have gained respect and brought breakthroughs in unimodal processing. This paper suggests a novel non-destructive and multimodal classification using deep convolutional neural networks that estimate fruit maturity by feature concatenation of data acquired from two imaging modes: visible-light and hyperspectral imaging systems. Morphological changes in the sample fruits can be easily measured with RGB images, while spectral signatures that provide high sensitivity and high correlation with the internal properties of fruits can be extracted from hyperspectral images with wavelength range in between 400 nm and 900 nm—factors that must be considered when building a model. This study further modified the architectures: AlexNet, VGG16, VGG19, ResNet50, ResNeXt50, MobileNet, and MobileNetV2 to utilize multimodal data cubes composed of RGB and hyperspectral data for sensitivity analyses. These multimodal variants can achieve up to 0.90 F1 scores and 1.45% top-2 error rate for the classification of six stages. Overall, taking advantage of multimodal input coupled with powerful deep convolutional neural network models can classify fruit maturity even at refined levels of six stages. This indicates that multimodal deep learning architectures and multimodal imaging have great potential for real-time in-field fruit maturity estimation that can help estimate optimal harvest time and other in-field industrial applications.