<p>Convolutional Neural Networks (ConvNets) have been thoroughly researched for document image classification and are known for their exceptional performance in unimodal image-based document classification. Recently, however, there has been a sudden shift in the field towards multimodal approaches that simultaneously learn from the visual and textual features of the documents. While this has led to significant advances in the field, it has also led to a waning interest in improving pure ConvNets-based approaches. This is not desirable, as many of the multimodal approaches still use ConvNets as their visual backbone, and thus improving ConvNets is essential to improving these approaches. In this paper, we present DocXClassifier, a ConvNet-based approach that, using state-of-the-art model design patterns together with modern data augmentation and training strategies, not only achieves significant performance improvements in image-based document classification, but also outperforms some of the recently proposed multimodal approaches. Moreover, DocXClassifier is capable of generating transformer-like attention maps, which makes it inherently interpretable, a property not found in previous image-based classification models. Our approach achieves a new peak performance in image-based classification on two popular document datasets, namely RVL-CDIP and Tobacco3482, with a top-1 classification accuracy of 94.17% and 95.57% on the two datasets, respectively. Moreover, it sets a new record for the highest image-based classification accuracy of 90.14% on Tobacco3482 without transfer learning from RVL-CDIP. Finally, our proposed model may serve as a powerful visual backbone for future multimodal approaches, by providing much richer visual features than existing counterparts.</p>
<p> Convolutional Neural Networks (ConvNets) have been thoroughly researched for document image classification and are known for their exceptional performance in unimodal image-based document classification. Recently, however, there has been a sudden shift in the field towards multimodal approaches that simultaneously learn from the visual and textual features of the documents. While this has led to significant advances in the field, it has also led to a waning interest in improving pure ConvNets-based approaches. This is not desirable, as many of the multimodal approaches still use ConvNets as their visual backbone, and thus improving ConvNets is essential to improving these approaches. In this paper, we present DocXClassifier, a ConvNet-based approach that, using state-of-the-art model design patterns together with modern data augmentation and training strategies, not only achieves significant performance improvements in image-based document classification, but also outperforms some of the recently proposed multimodal approaches. Moreover, DocXClassifier is capable of generating transformer-like attention maps, which makes it inherently interpretable, a property not found in previous image-based classification models. Our approach achieves a new peak performance in image-based classification on two popular document datasets, namely RVL-CDIP and Tobacco3482, with a top-1 classification accuracy of 94.17% and 95.57% on the two datasets, respectively. Moreover, it sets a new record for the highest image-based classification accuracy of 90.14% on Tobacco3482 without transfer learning from RVL-CDIP. Finally, our proposed model may serve as a powerful visual backbone for future multimodal approaches, by providing much richer visual features than existing counterparts. </p>
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.