Cardiovascular disease (CVD) is one of the leading causes of death globally. Currently, clinical diagnosis of CVD primarily relies on electrocardiograms (ECG), which are relatively easier to identify compared to other diagnostic methods. However, ensuring the accuracy of ECG readings requires specialized training for healthcare professionals. Therefore, developing a CVD diagnostic system based on ECGs can provide preliminary diagnostic results, effectively reducing the workload of healthcare staff and enhancing the accuracy of CVD diagnosis. In this study, a deep neural network with a cross-stage partial network and a cross-attention-based transformer is used to develop an ECG-based CVD decision system. To accurately represent the characteristics of ECG, the cross-stage partial network is employed to extract embedding features. This network can effectively capture and leverage partial information from different stages, enhancing the feature extraction process. To effectively distill the embedding features, a cross-attention-based transformer model, known for its robust scalability that enables it to process data sequences with different lengths and complexities, is employed to extract meaningful embedding features, resulting in more accurate outcomes. The experimental results showed that the challenge scoring metric of the proposed approach is 0.6112, which outperforms others. Therefore, the proposed ECG-based CVD decision system is useful for clinical diagnosis.