Model interpretability and robustness are becoming increasingly critical today for the safe and practical deployment of deep learning (DL) models in industrial settings. As DL-backed automated document processing systems become increasingly common in business workflows, there is a pressing need today to enhance interpretability and robustness for the task of document image classification, an integral component of such systems. Surprisingly, while much research has been devoted to improving the performance of deep models for this task, little attention has been given to their interpretability and robustness. In this paper, we aim to improve upon both aspects and introduce DocXClassifier, an inherently interpretable deep document classifier that not only achieves significant performance improvements over existing approaches in image-based document classification, but also holds the capability to simultaneously generate feature importance maps while making its predictions. Our approach attains state-of-the-art performance in image-based classification on two popular document datasets, RVL-CDIP and Tobacco3482, with top-1 classification accuracies of 94.17% and 95.57%, respectively. Additionally, it sets a new record for the highest image-based classification accuracy on Tobacco3482 without transfer learning from RVL-CDIP, at 90.14%. In addition, our proposed training strategy demonstrates superior robustness compared to existing approaches, significantly outperforming them on 19 out of 21 different types of novel data distortions, while achieving comparable results on the remaining two. By combining robustness with interpretability, DocXClassifier presents a promising step towards the practical deployment of DL models for document classification tasks.