Background Accurately differentiating stable mild cognitive impairment (sMCI) from progressive MCI (pMCI) is clinically relevant, and identification of pMCI is crucial for timely treatment before it evolves into Alzheimer's disease (AD). Objective To construct a convolutional neural network (CNN) model to differentiate pMCI from sMCI integrating features from structural magnetic resonance imaging (sMRI) and positron emission tomography (PET) images. Methods We proposed a multi-modal and multi-stage region of interest (ROI)-based fusion network (m2ROI-FN) CNN model to differentiate pMCI from sMCI, adopting a multi-stage fusion strategy to integrate deep semantic features and multiple morphological metrics derived from ROIs of sMRI and PET images. Specifically, ten AD-related ROIs of each modality images were selected as patches inputting into 3D hierarchical CNNs. The deep semantic features extracted by the CNNs were fused through the multi-modal integration module and further combined with the multiple morphological metrics extracted by FreeSurfer. Finally, the multilayer perceptron classifier was utilized for subject-level MCI recognition. Results The proposed model achieved accuracy of 77.4% to differentiate pMCI from sMCI with 5-fold cross validation on the entire ADNI database. Further, ADNI-1&2 were formed into an independent sample for model training and validation, and ADNI-3&GO were formed into another independent sample for multi-center testing. The model achieved 73.2% accuracy in distinguishing pMCI and sMCI on ADNI-1&2 and 75% accuracy on ADNI-3&GO. Conclusions An effective m2ROI-FN model to distinguish pMCI from sMCI was proposed, which was capable of capturing distinctive features in ROIs of sMRI and PET images. The experimental results demonstrated that the model has the potential to differentiate pMCI from sMCI.