2023
DOI: 10.1016/j.patcog.2023.109419
|View full text |Cite
|
Sign up to set email alerts
|

VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 18 publications
(3 citation statements)
references
References 8 publications
0
3
0
Order By: Relevance
“…For instance, the Transformers maybe now used to provide end to end solutions and address various modalities related to document processing tasks, such as classification, question answering or NER [32], [70]. The diverse nature of documents necessitates multimodal reasoning that encompasses various types of inputs [8]. These inputs, including visual, textual, and layout elements, are found in a variety of document sources.…”
Section: F Turning To Efficient Solutions For Industrymentioning
confidence: 99%
“…For instance, the Transformers maybe now used to provide end to end solutions and address various modalities related to document processing tasks, such as classification, question answering or NER [32], [70]. The diverse nature of documents necessitates multimodal reasoning that encompasses various types of inputs [8]. These inputs, including visual, textual, and layout elements, are found in a variety of document sources.…”
Section: F Turning To Efficient Solutions For Industrymentioning
confidence: 99%
“…11(b), it is easy to find that current large-scale PTMs are optimized on servers with more than 8 GPUs. Also, many of them are trained using more than 100 GPUs, such as BriVL (128) [103] , VLC (128) [160] , M6 (128) [100] , SimVLM (512) [111] , MURAL (512) [150] , CLIP (256) [19] , VATT (256) [162] , Florence (512) [163] , FILIP (192) [181] . Some MM-PTMs are trained on TPUs with massive chips, for example, the largest model of Flamingo [169] is trained for 15 days on 1 536 chips.…”
Section: Model Parameters and Training Informationmentioning
confidence: 99%
“…With growing research in vision-and-language and contrastive learning [28,18], recent research has focused on improving the performance and efficiency of VLC approaches. They propose new model architectures [24,2], better visual representation [7,27], loss function design [14,16], or sampling strategies [5,12]. However, these methods are still not suitable for variable-length reports and are inefficient in low-resource settings.…”
Section: Introductionmentioning
confidence: 99%