In the twenty-first century, storing and managing digital documents has become commonplace for all corporate and public sectors around the world. Physical documents are scanned in batches and stored in a digital archive as a heterogeneous document stream, referred to as a digital package. To make Robotic Process Automation (RPA) easier, it's necessary to automatically segment the document stream into a subset of independent, coherent multi-page documents by detecting the appropriate document boundary. It's a common requirement of a TI company's Automated Document Management Systems (ADMS), where business operations are automated using RPA and the goal is to extract information from digital documents with minimal user intervention. The current study proposes, evaluates, and compares a multi-modal binary classification network incorporating text and picture aspects of digital document pages to state-of-the-art baseline methodologies. Image and textual features are extracted simultaneously from the input document image by passing them through Visual Geometry Group 16 -Convolutional Neural Network (VGG16-CNN) and pre-trained Bidirectional Encoder Representations from Transformers (Legal-BERT base ) model through transfer learning respectively. Both features are finally fused and passed through a fully connected layer of Multi Layered Perceptron (MLP) to obtain the binary classification of the pages as the First Page (FP) and Other Page (OP). Real-time document image streams from production business process archive were obtained from a reputed Title Insurance (TI) company for the study. The obtained F 1 score of 97.37% and 97.15% are significantly higher than the accuracies of the considered two baseline models and well above the expected Straight Through Pass (STP) threshold defined by the process admin.