2021
DOI: 10.1007/978-3-030-92185-9_26
|View full text |Cite
|
Sign up to set email alerts
|

Document Image Classification Method Based on Graph Convolutional Network

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
2
1

Relationship

0
6

Authors

Journals

citations
Cited by 8 publications
(12 citation statements)
references
References 18 publications
0
12
0
Order By: Relevance
“…As can be seen from the table, our best performing model DocXClassifier-XL achieved 94.17% accuracy on the RVL-CDIP dataset, outperforming all previous image-based methods by a significant margin of +1.86%. It is interesting to note that even our lightest variant DocXClassifier-B achieved a comparable accuracy of 94.00%, and performed significantly better than all existing image-based models as well as some of the more sophisticated multimodal approaches [35], [46], [47], thus representing a good trade-off between accuracy and computational cost. It is important to note that two of the best performing multimodal solutions, those of Kanchi et al (2022) [48] and Bakkali et al (2020) [17], simply combined ConvNetbased visual backbones (EfficientNet and NasNet, respectively) with a Transformer-based textual backbone (BERT) to achieve extraordinary improvements in document classification.…”
Section: B Overall Evaluationmentioning
confidence: 92%
See 2 more Smart Citations
“…As can be seen from the table, our best performing model DocXClassifier-XL achieved 94.17% accuracy on the RVL-CDIP dataset, outperforming all previous image-based methods by a significant margin of +1.86%. It is interesting to note that even our lightest variant DocXClassifier-B achieved a comparable accuracy of 94.00%, and performed significantly better than all existing image-based models as well as some of the more sophisticated multimodal approaches [35], [46], [47], thus representing a good trade-off between accuracy and computational cost. It is important to note that two of the best performing multimodal solutions, those of Kanchi et al (2022) [48] and Bakkali et al (2020) [17], simply combined ConvNetbased visual backbones (EfficientNet and NasNet, respectively) with a Transformer-based textual backbone (BERT) to achieve extraordinary improvements in document classification.…”
Section: B Overall Evaluationmentioning
confidence: 92%
“…Recently, there has been an increased emphasis on multimodal classification techniques [13], [34], [35], in which document images are preprocessed to extract the textual content using stand-alone OCR software, and then visual, textual, and other layout features are used together for classification. Initial work in this area focused mainly on generating textual and visual embeddings using two separate deep network streams [12], [13] and then integrating them into a single embedding for final classification.…”
Section: Related Work a Document Image Classificationmentioning
confidence: 99%
See 1 more Smart Citation
“…Recently, there has been an increased emphasis on multimodal classification techniques [15], [35], [36], in which document images are preprocessed to extract the textual content using stand-alone OCR software, and then visual, textual, and other layout features are used together for classification. Initial work in this area focused mainly on generating textual and visual embeddings using two separate deep network streams [14], [15] and then integrating them into a single embedding for final classification.…”
Section: Related Work a Document Image Classificationmentioning
confidence: 99%
“…However, both approaches require pre-training with large amounts of document data. In a slightly different direction, Graph ConvNets [36] have also been recently explored for multimodal classification and show promising results.…”
Section: Related Work a Document Image Classificationmentioning
confidence: 99%