Designing reliable and fast segmentation algorithms of ancient documents has been a topic of major interest for many libraries and the prime issue of research in the document analysis community. Thus, we propose in this article a fast ancient document enhancement and segmentation algorithm based on using Simple Linear Iterative Clustering (SLIC) superpixels and Gabor descriptors in a multi-scale approach. Firstly, in order to obtain enhanced backgrounds of noisy ancient documents, a novel foreground/background segmentation algorithm based on SLIC superpixels, is introduced. Once, the SLIC technique is carried out, the background and foreground superpixels are classified. Then, an enhanced and non-noisy background is achieved after processing the background superpixels. Subsequently, Gabor descriptors are only extracted from the selected foreground superpixels of the enhanced gray-level ancient book document images by adopting a multi-scale approach. Finally, for ancient document image segmentation, a foreground superpixel clustering task is performed by partitioning Gabor-based feature sets into compact and well-separated clusters in the feature space. The proposed algorithm does not assume any a priori information regarding document image content and structure and provides interesting results on a large corpus of ancient documents. Qualitative and numerical experiments are given to demonstrate the enhancement and segmentation quality.
International audienceTo reach the objective of ensuring the indexing and retrieval of digitized resources and offering a structured access to large sets of cultural heritage documents, a raising interest to historical document image segmentation has been generated. In fact, there is a real need for automatic algorithms ensuring the identification of homogeneous regions or similar groups of pixels sharing some visual characteristics from historical documents (i.e. distinguishing graphic types, segmenting graphical regions from textual ones, and discriminating text in a variety of situations of different fonts and scales). Indeed, determining graphic regions can help to segment and analyze the graphical part in historical heritage, while finding text zones can be used as a pre-processing stage for character recognition, text line extraction, handwriting recognition, etc. Thus, we propose in this article an automatic segmentation method for historical document images based on extraction of homogeneous or similar content regions. The proposed algorithm is based on using simple linear iterative clustering (SLIC) su-perpixels, Gabor filters, multi-scale analysis, majority voting technique, connected component analysis, color layer separation, and an adaptive run-length smoothing algorithm (ARLSA). It has been evaluated on 1000 pages of historical documents and achieved interesting results
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.