Hierarchical logical structure extraction of book documents by analyzing tables of contents

2008

SPIE Proceedings

In this paper, we present a hybrid approach to splitting a book document into individual chapters. We use multiple sources of information to obtain a reliable assessment of the chapter title pages. These sources are produced by four methods: blank space detection, font analysis, header and footer association, and table of content (TOC) analysis.Finally, a combination component is used to score potential chapter title pages and select the best candidates. This approach takes full advantage of various kinds of information such as page header and footer, layout, and keywords. It works well even without the information of TOC which is crucial for most previous similar researches. Experiments show that this approach is robust and reliable.

Section: Methods 4: Text Matchingmentioning

confidence: 92%

“…Feng et al 5 exploited the indentation, page numbers and numbering scheme to compute the logical structure of a book. Belaïd et al 6 proposed a labeling approach to recognize the TOC of scientific journal in the Calliope electronic library, extracted the page numbers from the TOC and used them to find the starting page of each article.…”

Section: Introductionmentioning

confidence: 99%

<title>A mixed approach to book splitting</title>

2008

SPIE Proceedings

“…The structures of page number, header, footer, headline, figure and body text are analyzed and matched with information on the contents pages to reconstruct the links between ToC and body text. He et al [17] propose a method to extract the hierarchical logical structure of book documents, along with the reference information, by combining the spatial and the semantic information of ToC in a book.…”

Section: Related Workmentioning

confidence: 99%

Structure extraction from PDF-based book documents

Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries

Lin³

et al. 2011

Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical structure. The analysis is conducted at both page level and document level, including global typographies, reading order, logical elements, chapter/section hierarchy and metadata. Moreover, two characteristics of PDF-based books, i.e., style consistency in the whole book document and natural rendering order of PDF files, are fully exploited in this paper to improve the conventional image-based structure extraction methods. This paper employs the bipartite graph as a common structure for modeling various tasks, including reading order recovery, figure and caption association, and metadata extraction. Based on the graph representation, the optimal matching (OM) method is utilized to find the global optima in those tasks. Extensive benchmarking using real-world data validates the high efficiency and discrimination ability of the proposed method.

“…Lin et al [4] introduced a system of TOC page analysis using layout modeling and headline matching, and acquired the logical structure of the TOC through in-depth analysis of its numbering scheme. He et al [5] combined geometrical rules (indentations) and semantic rules (typical text sequences identifying chapters and sections) to extract the hierarchical structure in Chinese books.…”

Section: Related Workmentioning

confidence: 99%

Analysis of book documents' table of content based on clustering

2009 10th International Conference on Document Analysis and Recognition

Lin

et al. 2009

Table of contents (TOC) recognition has attracted a great deal of attention in recent years. After reviewing the merits and drawbacks of the existing TOC recognition methods, we have observed that book documents are multi-page documents with intrinsic local format consistency. Based on this finding we introduce an automatic TOC analysis method through clustering. This method first detects the decorative elements in TOC pages. Then it learns a layout model used in the TOC pages through clustering. Finally, it generates TOC entries and extracts their hierarchical structure under the guidance of the model. More specifically, broken lines are taken into account in the method. Experimental results show that this method achieves high accuracy and efficiency. In addition, this method has been successfully applied in a commercial E-book production software package.