In this paper a system for analysis and automatic indexing of imaged documents for high-volume applications is described. This system, named STRETCH (STorage and RETrieval by Content of imaged documents), is based on an Archiving and Retrieval Engine, which overcomes the bottleneck of document profiling bypassing some limitations of existing pre-defined indexing schemes. The engine exploits a structured document representation and can activate appropriate methods to characterise and automatically index heterogeneous documents with variable layout. The originality of STRETCH lies principally in the possibility for unskilled users to define the indexes relevant to the document domains of their interest by simply presenting visual examples and applying reliable automatic information extraction methods (document classification, flexible reading strategies) to index the documents automatically, thus creating archives as desired. STRETCH offers ease of use and application programming and the ability to dynamically adapt to new types of documents. The system has been tested in two applications in particular, one concerning passive invoices and the other bank documents. In these applications, several classes of documents are involved. The indexing strategy first automatically classifies the document, thus avoiding pre-sorting, then locates and reads the information pertaining to the specific document class. Experimental results are encouraging overall; in particular, document classification results fulfill the requirements of high-volume applications . Integration into production lines is under execution.
In this paper a system for storing and retrieving imaged multimedia documents by content is described. This system is being developed within the Esprit project STRETCH (STorage and RETrieval by Content of imaged documents). The core of STRETCH system is a powerful Archiving and Retrieval Engine, based on a structured document representation and capable of activating appropriate methods to characterise and automatically index heterogeneous documents with variable layout and subsequently retrieve them by answering to complex queries. The produced document base, or "Docu-base", relies on an object-oriented internal representation and related characterisation and search methods. A prototype was implemented and successfully tested, in particular, in the creation of an invoice archive.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.