The purpose of this study is the reduction of the burden in the document structurization process. A technique is presented for extracting the document architecture. As the technical document, 12,000 articles are extracted from the proceedings of a national convention. A summary of sample sentences as well as approximately 500 office documents within the organization also are examined as business documents. The rules for extracting the architecture are derived. The technique developed for document architecture extraction can extract such hierarchical structures as chapters and sections, as well as the reference structure to figures and tables from the technical document. The technique can also extract the hierarchical structure such as communications and reports from the business document. The technical and business documents can be discriminated by analyzing the character strings.
As a result of evaluation using proceedings and in‐office documents other than those used for deriving the rules, the error rate is 10.0 percent for the technical document and 23.0 percent for the business document. The error in extracting the reference structure is 8 percent. A field test is executed after improving the method so that the equations, figures and tables embedded in the text can be handled. The error rate is 5.4 percent for the technical document and 15.4 percent for the business document. It is verified through examples that the structurization can be achieved in a considerably shorter time than by manual processing. The developed document architecture extraction technique is commercialized as an automatic system by combining the technique with the layout attribute. The developed extraction technique will be utilized effectively in the hypertext conversion of the existing document and other problems, in addition to the layout processing.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.