The objective of document preprocessing is to ease the text recognition or the document indexing processes. The analysis of historical documents seems to be a big challenge because the majority of those documents are noisy and present many degradations. In this paper we propose a preprocessing framework for a large dataset of historical documents. The proposed framework is decomposed of two phases, the selection and the evaluation. During the first phase one or multiple methods are corresponded for each book of the used database. The validation of the selection results is performed during the evaluation. The experiments are applied on printed and handwritten documents extracted respectively from Google-Books and Bayerische Staatsbibliothek databases. The results returned during the evaluation are very promising.
Data Warehouses and OLAP (On Line Analytical Processing) technologies are dedicated to analyzing structured data issued from organizations' OLTP (On Line Transaction Processing) systems. Furthermore, in order to enhance their decision support systems, these organizations need to explore XML (eXtensible Markup Language) documents as an additional and important source of unstructured data. In this context, this paper addresses the warehousing of document-centric XML documents. More specifically, we propose a two-method approach to build Document Warehouse conceptual schemas. The first method is for the unification of XML document structures; it aims to elaborate a global and generic view for a set of XML documents belonging to the same domain. The second method is for designing multidimensional galaxy schemas for Document Warehouses. Les entrepôts de données et les technologies d'analyses en ligne OLAP («On Line Analytical Processing») sont dédiés à l'analyse des données structurées issues des systèmes OLTP («On Line Transaction Processing») des organisations. De plus, ces organisations ont besoin d'explorer des documents XML, comme une importante source additionnelle de données non structurées, à des fins de prise de décisions. Dans ce contexte, cet article s'intéresse à l'entreposage de documents XML orienté-document. Plus particulièrement, nous proposons une approche composées de deux méthodes pour la construction d'un schéma conceptuel d'un Entrepôt de Documents. La première méthode est pour l'unification des structures de documents XML ; elle vise à élaborer une vue globale et générique pour un ensemble de documents XML appartenant à un même domaine. La seconde méthode est pour la modélisation multidimensionnelle en galaxie de schémas d'Entrepôts de Documents.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.