Background: Cardiovascular diseases (CVDs) are a leading cause of death worldwide. Deep learning methods have been widely used in the field of medical image analysis and have shown promising results in the diagnosis of CVDs. Methods: Experiments were performed on 12-lead electrocardiogram (ECG) databases collected by Chapman University and Shaoxing People’s Hospital. The ECG signal of each lead was converted into a scalogram image and an ECG grayscale image and used to fine-tune the pretrained ResNet-50 model of each lead. The ResNet-50 model was used as a base learner for the stacking ensemble method. Logistic regression, support vector machine, random forest, and XGBoost were used as a meta learner by combining the predictions of the base learner. The study introduced a method called multi-modal stacking ensemble, which involves training a meta learner through a stacking ensemble that combines predictions from two modalities: scalogram images and ECG grayscale images. Results: The multi-modal stacking ensemble with a combination of ResNet-50 and logistic regression achieved an AUC of 0.995, an accuracy of 93.97%, a sensitivity of 0.940, a precision of 0.937, and an F1-score of 0.936, which are higher than those of LSTM, BiLSTM, individual base learners, simple averaging ensemble, and single-modal stacking ensemble methods. Conclusion: The proposed multi-modal stacking ensemble approach showed effectiveness for diagnosing CVDs.