With the advancements in processing units and easy availability of cloud-based GPU servers, many deep learning-based methods have been proposed for Aspect Level Sentiment Classification (ALSC) literature. With this increase in the number of deep learning methods proposed in ALSC literature, it has become difficult to ascertain the performance difference of one method over the other. To this end, our study provides a statistical comparison of the performance of 35 recent deep learning methods with respect to three performance metrics-Accuracy, Macro F1 score, and Time. The methods are evaluated for eight benchmark datasets. In this study, the statistical comparison is based on Friedman, Nemenyi, and Wilcoxon tests. As per the results of statistical tests, the top-ranking methods could not significantly outperform several other methods in terms of Accuracy and Macro F1 score and performed poorly on-time metric. However, the time taken by any method is crucial to analyze the overall performance. Thus, this study aids the selection of the Deep Learning method, which maximizes the accuracy and Macro F1 score and takes minimal time. Our study also establishes a framework for validating the performance of new and alternate methods in ALSC that can be helpful for researchers and practitioners working in this area.