Deep learning (DL) has been introduced in automatic heart-abnormality classification using ECG signals, while its application in practical medical procedures is limited. A systematic review is performed from perspectives of the ECG database, preprocessing, DL methodology, evaluation paradigm, performance metric, and code availability to identify research trends, challenges, and opportunities for DL-based ECG arrhythmia classification. Specifically, 368 studies meeting the eligibility criteria are included. A total of 223 (61%) studies use MIT-BIH Arrhythmia Database to design DL models. A total of 138 (38%) studies considered removing noise or artifacts in ECG signals, and 102 (28%) studies performed data augmentation to extend the minority arrhythmia categories. Convolutional neural networks are the dominant models (58.7%, 216) used in the reviewed studies while growing studies have integrated multiple DL structures in recent years. A total of 319 (86.7%) and 38 (10.3%) studies explicitly mention their evaluation paradigms, i.e., intra- and inter-patient paradigms, respectively, where notable performance degradation is observed in the inter-patient paradigm. Compared to the overall accuracy, the average F1 score, sensitivity, and precision are significantly lower in the selected studies. To implement the DL-based ECG classification in real clinical scenarios, leveraging diverse ECG databases, designing advanced denoising and data augmentation techniques, integrating novel DL models, and deeper investigation in the inter-patient paradigm could be future research opportunities.