Proceedings of the 11th Annual International ACM/IEEE Joint Conference on Digital Libraries 2011
DOI: 10.1145/1998076.1998079
|View full text |Cite
|
Sign up to set email alerts
|

Structure extraction from PDF-based book documents

Abstract: Nowadays PDF documents have become a dominating knowledge repository for both the academia and industry largely because they are very convenient to print and exchange. However, the methods of automated structure information extraction are yet to be fully explored and the lack of effective methods hinders the information reuse of the PDF documents. To enhance the usability for PDF-formatted electronic books, we propose a novel computational framework to analyze the underlying physical structure and logical stru… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
19
0

Year Published

2012
2012
2021
2021

Publication Types

Select...
3
3
2

Relationship

0
8

Authors

Journals

citations
Cited by 25 publications
(19 citation statements)
references
References 25 publications
0
19
0
Order By: Relevance
“…Such methods are brittle and may fail when presented with documents that follow new or different publishing styles. Gao et al [12] recently described SEB, a framework to detect the hierarchy and reading order of a document using weighted bipartite graphs but its fixed rules are again not flexible enough to capture document metadata in practical scenarios.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Such methods are brittle and may fail when presented with documents that follow new or different publishing styles. Gao et al [12] recently described SEB, a framework to detect the hierarchy and reading order of a document using weighted bipartite graphs but its fixed rules are again not flexible enough to capture document metadata in practical scenarios.…”
Section: Related Workmentioning
confidence: 99%
“…Systems by both Gao et al [12] and Lopez et al [21] correlate figures to their captions, enabling better document figure retrieval.…”
Section: Related Workmentioning
confidence: 99%
“…For example, Chaudhury et al [12] analyzed the layouts of scanned images of newspaper articles and automatically adds heading information. Also, Liangcai et al [13] proposed a system to automatically extract structures from a digital book format, PDF. These projects focus on automatically creating structures from scanned images or digital books.…”
Section: Structurization In Digitizationmentioning
confidence: 99%
“…Thus, it is now possible to find the references of a book during a book search. In addition, Thomson Reuters has released their Book Citation Index, giving researchers access to the citation network between books and the wider world of scholarly and scientific research and full bibliographies from books and book chapters 6 . However, to the best of our knowledge, there are no academic search engines or digital libraries that provide a citation list of a book that enables navigation to the sources cited in a book, even though this facility is typically available for papers (for example, in CiteSeer [9]).…”
Section: Introductionmentioning
confidence: 99%
“…It is still a challenging problem to design a ToC recognition algorithm that can be effectively applied to large scale heterogeneous books [14]. Gao et al studied both ToC and metadata extraction from PDF book documents by modeling them as a matching problem on the bipartite graph [6]. Feng et al studied how to restructure the OCR output of books using a Hidden Markov Model (HMM) based hierarchical alignment algorithm [5].…”
mentioning
confidence: 99%