Machine Learning vs Deterministic Rule-Based System for Document Stream Segmentation

Hamdi, Ahmed; Voerman, Joris; Coustaty, Mickaël; Joseph, Aurélie; d'Andecy, Vincent Poulain; Ogier, Jean-Marc

doi:10.1109/icdar.2017.332

Cited by 7 publications

(4 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A contextual and layout descriptor-based approach that represented the relationship of two consecutive pages of document stream was presented by Hamdi et al [5], [14]. In this approach, every page was represented with binary features of contextual and layout information, such as the textual fingerprint, ending signs, page number, dates, etc.…”

Section: Related Workmentioning

confidence: 99%

“…A two-class clas-sifier was trained using a decision tree to classify the pages into either a continuation or a break class where continuation class determines a page to be a continuation of the previous page, and break class determines the beginning of a new document. In a continuous effort to find the best approach, the authors compared the segmentation result using both rule-based and a machine learning-based approach to define the features and found the machine learning-based approach to produce better results than the rule-based approach [5].…”

Section: Related Workmentioning

confidence: 99%

“…We have traced the evolution (Table 1) of the DSS technologies starting from the stochastic Markov chain model in 2009 [3] through the deep image-based page feature extraction and classification in 2016 [4], rule-based approach in 2017 [5] to a more sophisticated state-of-the-art multi-modal deep learning approach combining text and image features of the document page until 2021 [6]. Although, there is a clear dearth of the overall study observed in the domain of DSS, the recent breakthrough in the domain of DSS by Wiedemann and Heyer [7] shows promising results with Tobacco800 public data set and a proprietary data set.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain

et al. 2022

View full text Add to dashboard Cite

In the twenty-first century, storing and managing digital documents has become commonplace for all corporate and public sectors around the world. Physical documents are scanned in batches and stored in a digital archive as a heterogeneous document stream, referred to as a digital package. To make Robotic Process Automation (RPA) easier, it's necessary to automatically segment the document stream into a subset of independent, coherent multi-page documents by detecting the appropriate document boundary. It's a common requirement of a TI company's Automated Document Management Systems (ADMS), where business operations are automated using RPA and the goal is to extract information from digital documents with minimal user intervention. The current study proposes, evaluates, and compares a multi-modal binary classification network incorporating text and picture aspects of digital document pages to state-of-the-art baseline methodologies. Image and textual features are extracted simultaneously from the input document image by passing them through Visual Geometry Group 16 -Convolutional Neural Network (VGG16-CNN) and pre-trained Bidirectional Encoder Representations from Transformers (Legal-BERT base ) model through transfer learning respectively. Both features are finally fused and passed through a fully connected layer of Multi Layered Perceptron (MLP) to obtain the binary classification of the pages as the First Page (FP) and Other Page (OP). Real-time document image streams from production business process archive were obtained from a reputed Title Insurance (TI) company for the study. The obtained F 1 score of 97.37% and 97.15% are significantly higher than the accuracies of the considered two baseline models and well above the expected Straight Through Pass (STP) threshold defined by the process admin.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain

et al. 2022

View full text Add to dashboard Cite

show abstract

“…These descriptors can be section numbers, page numbers, dates, salutation and conclusion formulas. The technique in [Hamdi et al, 2017] and [Hamdi et al, 2018], uses Doc2Vec model to realize the segmentation task. At first, the Doc2Vec is trained to learn the documents pages representation.…”

Section: Related Workmentioning

confidence: 99%

Use of Language Models for Document Stream Segmentation

Neche

Belaïd

2020

Proceedings of the 9th International Conference on Pattern Recognition Applications and Methods

View full text Add to dashboard Cite

Page stream segmentation into single documents is a very common task which is practiced in companies and administrations when processing their incoming mail. It is not a straightforward task because the limits of the documents are not always obvious, and it is not always easy to find common features between the pages of the same document. In this paper, we seek to compare existing segmentation models and propose a new segmentation one based on GRUs (Gated Recurrent Unit) and an attention mechanism, named AGRU. This model uses the text content of the previous page and the current page to determine if both pages belong to the same document. So, due to its attention mechanism, this model is capable to recognize words that define the first page of a document. Training and evaluation are carried out on two datasets: Tobacco-800 and READ-Corpus. The former is a public dataset on which our model reaches an F1 score equal to 90%, and the later is private for which our model reaches an F1 score equal to 96%.

show abstract

Segmentation and Classification of Pages for Digitized Documents of the Public Prosecutor’s Office

Rivera,

Quintanilla,

Espezua

2023

Communications in Computer and Information Science

View full text Add to dashboard Cite

Machine Learning vs Deterministic Rule-Based System for Document Stream Segmentation

Cited by 7 publications

References 14 publications

A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain

A Multi-Modal Approach to Digital Document Stream Segmentation for Title Insurance Domain

Use of Language Models for Document Stream Segmentation

Segmentation and Classification of Pages for Digitized Documents of the Public Prosecutor’s Office

Contact Info

Product

Resources

About