VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

Bakkali, Souhail; Ming, Zuheng; Coustaty, Mickaël; Rusiñol, Marçal; Terrades, Oriol Ramos

doi:10.1016/j.patcog.2023.109419

Cited by 18 publications

(3 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For instance, the Transformers maybe now used to provide end to end solutions and address various modalities related to document processing tasks, such as classification, question answering or NER [32], [70]. The diverse nature of documents necessitates multimodal reasoning that encompasses various types of inputs [8]. These inputs, including visual, textual, and layout elements, are found in a variety of document sources.…”

Section: F Turning To Efficient Solutions For Industrymentioning

confidence: 99%

An Overview of Data Extraction From Invoices

Saout,

Lardeux,

Saubion

2024

IEEE Access

View full text Add to dashboard Cite

This paper provides a comprehensive overview of the process for information retrieval from invoices. Invoices serve as proof of purchase and contain important information, including the date, description, quantity, and the price of goods or services, as well as the terms of payment. Companies must process invoices quickly and accurately to maintain proper financial records. To automate this workflow, commercial systems have been developed. Despite the complexity involved, realizing automated processing of invoices necessitates the harmonious integration of a wide range of techniques and methods. While several surveys have shed light on different aspects of this workflow, our objective in this paper is to present a synthetic view of the process and emphasize the most pertinent challenges. We discuss the digitalization of invoices and the use of natural language processing techniques to extract relevant information. We also review machine learning and deep learning techniques that are widely used to handle the variability of layouts, minimize end-user tasks, and train and adapt to new contexts. The purpose of this overview is not to evaluate various systems and algorithms, but rather to propose a survey that reviews a wide scope of techniques for different data extraction tasks, addressing both information extraction and structure recognition for invoice processing. Specifically, we focus on table processing , paying particular attention to graph-based approaches.

show abstract

Section: F Turning To Efficient Solutions For Industrymentioning

confidence: 99%

An Overview of Data Extraction From Invoices

Saout,

Lardeux,

Saubion

2024

IEEE Access

View full text Add to dashboard Cite

show abstract

“…11(b), it is easy to find that current large-scale PTMs are optimized on servers with more than 8 GPUs. Also, many of them are trained using more than 100 GPUs, such as BriVL (128) [103] , VLC (128) [160] , M6 (128) [100] , SimVLM (512) [111] , MURAL (512) [150] , CLIP (256) [19] , VATT (256) [162] , Florence (512) [163] , FILIP (192) [181] . Some MM-PTMs are trained on TPUs with massive chips, for example, the largest model of Flamingo [169] is trained for 15 days on 1 536 chips.…”

Section: Model Parameters and Training Informationmentioning

confidence: 99%

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

et al. 2023

View full text Add to dashboard Cite

With the urgent demand for generalized deep models, many pre-trained big models are proposed, such as bidirectional encoder representations (BERT), vision transformer (ViT), generative pre-trained transformers (GPT), etc. Inspired by the success of these models in single domains (like computer vision and natural language processing), the multi-modal pre-trained big models have also drawn more and more attention in recent years. In this work, we give a comprehensive survey of these models and hope this paper could provide new insights and helps fresh researchers to track the most cutting-edge works. Specifically, we firstly introduce the background of multi-modal pre-training by reviewing the conventional deep learning, pre-training works in natural language process, computer vision, and speech. Then, we introduce the task definition, key challenges, and advantages of multi-modal pre-training models (MM-PTMs), and discuss the MM-PTMs with a focus on data, objectives, network architectures, and knowledge enhanced pre-training. After that, we introduce the downstream tasks used for the validation of large-scale MM-PTMs, including generative, classification, and regression tasks. We also give visualization and analysis of the model parameters and results on representative downstream tasks. Finally, we point out possible research directions for this topic that may benefit future works. In addition, we maintain a continuously updated paper list for large-scale pre-trained multi-modal big models: https://github.com/wangxiao5791509/MultiModal_BigModels_Survey.

show abstract

“…With growing research in vision-and-language and contrastive learning [28,18], recent research has focused on improving the performance and efficiency of VLC approaches. They propose new model architectures [24,2], better visual representation [7,27], loss function design [14,16], or sampling strategies [5,12]. However, these methods are still not suitable for variable-length reports and are inefficient in low-resource settings.…”

Section: Introductionmentioning

confidence: 99%

Enhancing Automatic Placenta Analysis Through Distributional Feature Recomposition in Vision-Language Contrastive Learning

Pan,

Cai,

Mehta

et al. 2023

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The placenta is a valuable organ that can aid in understanding adverse events during pregnancy and predicting issues postbirth. Manual pathological examination and report generation, however, are laborious and resource-intensive. Limitations in diagnostic accuracy and model efficiency have impeded previous attempts to automate placenta analysis. This study presents a novel framework for the automatic analysis of placenta images that aims to improve accuracy and efficiency. Building on previous vision-language contrastive learning (VLC) methods, we propose two enhancements, namely Pathology Report Feature Recomposition and Distributional Feature Recomposition, which increase representation robustness and mitigate feature suppression. In addition, we employ efficient neural networks as image encoders to achieve model compression and inference acceleration. Experiments validate that the proposed approach outperforms prior work in both performance and efficiency by significant margins. The benefits of our method, including enhanced efficacy and deployability, may have significant implications for reproductive healthcare, particularly in rural areas or low-and middleincome countries.

show abstract

VLCDoC: Vision-Language contrastive pre-training model for cross-Modal document classification

Cited by 18 publications

References 8 publications

An Overview of Data Extraction From Invoices

An Overview of Data Extraction From Invoices

Large-scale Multi-modal Pre-trained Models: A Comprehensive Survey

Enhancing Automatic Placenta Analysis Through Distributional Feature Recomposition in Vision-Language Contrastive Learning

Contact Info

Product

Resources

About