Structure extraction from PDF-based book documents

Gao, Liangcai; Tang, Zhi; Lin, Xiaofan; Ying, Liu; Qiu, Ruiheng; Wang, Yongtao

doi:10.1145/1998076.1998079

Cited by 25 publications

(19 citation statements)

References 25 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Such methods are brittle and may fail when presented with documents that follow new or different publishing styles. Gao et al [12] recently described SEB, a framework to detect the hierarchy and reading order of a document using weighted bipartite graphs but its fixed rules are again not flexible enough to capture document metadata in practical scenarios.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Extracting and matching authors and affiliations in scholarly documents

Nhat

Chandrasekaran

Cho

et al. 2013

Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries

View full text Add to dashboard Cite

We introduce Enlil, an information extraction system that discovers the institutional affiliations of authors in scholarly papers. Enlil consists of two steps: one that first identifies authors and affiliations using a conditional random field; and a second support vector machine that connects authors to their affiliations. We benchmark Enlil in three separate experiments drawn from three different sources: the ACL Anthology, the ACM Digital Library, and a set of cross-disciplinary scientific journal articles acquired by querying Google Scholar. Against a state-of-the-art production baseline, Enlil reports a statistically significant improvement in F1 of nearly 10% (p « 0.01). In the case of multidisciplinary articles from Google Scholar, Enlil is benchmarked over both clean input (F1 > 90%) and automatically-acquired input (F1 > 80%).We have deployed Enlil in a case study involving Asian genomics research publication patterns to understand how government sponsored collaborative links evolve. Enlil has enabled our team to construct and validate new metrics to quantify the facilitation of research as opposed to direct publication.

show abstract

Section: Related Workmentioning

confidence: 99%

“…Systems by both Gao et al [12] and Lopez et al [21] correlate figures to their captions, enabling better document figure retrieval.…”

Section: Related Workmentioning

confidence: 99%

Extracting and matching authors and affiliations in scholarly documents

Nhat

Chandrasekaran

Cho

et al. 2013

Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries

View full text Add to dashboard Cite

show abstract

“…For example, Chaudhury et al [12] analyzed the layouts of scanned images of newspaper articles and automatically adds heading information. Also, Liangcai et al [13] proposed a system to automatically extract structures from a digital book format, PDF. These projects focus on automatically creating structures from scanned images or digital books.…”

Section: Structurization In Digitizationmentioning

confidence: 99%

Transforming Japanese archives into accessible digital books

Ishihara

Itoko

Sato

et al. 2012

Proceedings of the 12th ACM/IEEE-CS Joint Conference on Digital Libraries

View full text Add to dashboard Cite

Digitized physical books offer access to tremendous amounts of knowledge, even for people with print-related disabilities. Various projects and standard activities are underway to make all of our past and present books accessible. However digitizing books requires extensive human efforts such as correcting the results of OCR (optical character recognition) and adding structural information such as headings. Some Asian languages need extra efforts for the OCR errors because of their many and varied character sets. Japanese has used more than 10,000 characters compared with a few hundred in English. This heavy workload is inhibiting the creation of accessible digital books. To facilitate digitization, we are developing a new system for processing physical books. We reduce and disperse the human efforts and accelerate conversions by combining automatic inference and human capabilities. Our system preserves the original page images for the entire digitization process to support gradual refinement and distributes the work as micro-tasks. We conducted trials with the Japanese National Diet Library (NDL) to evaluate the required effort for digitizing books with a variety of layouts and years of publication. The results showed old Japanese books had specific problems when correcting the OCR errors and adding structures. Drawing on our results, we discuss further workload reductions and future directions for international digitization systems.

show abstract

“…Thus, it is now possible to find the references of a book during a book search. In addition, Thomson Reuters has released their Book Citation Index, giving researchers access to the citation network between books and the wider world of scholarly and scientific research and full bibliographies from books and book chapters 6 . However, to the best of our knowledge, there are no academic search engines or digital libraries that provide a citation list of a book that enables navigation to the sources cited in a book, even though this facility is typically available for papers (for example, in CiteSeer [9]).…”

Section: Introductionmentioning

confidence: 99%

“…It is still a challenging problem to design a ToC recognition algorithm that can be effectively applied to large scale heterogeneous books [14]. Gao et al studied both ToC and metadata extraction from PDF book documents by modeling them as a matching problem on the bipartite graph [6]. Feng et al studied how to restructure the OCR output of books using a Hidden Markov Model (HMM) based hierarchical alignment algorithm [5].…”

mentioning

confidence: 99%

Searching online book documents and analyzing book citations

Das

et al. 2013

Proceedings of the 2013 ACM Symposium on Document Engineering

View full text Add to dashboard Cite

Academic search engines and digital libraries provide convenient online search and access facilities for scientific publications. However, most existing systems do not include books in their collections although several books are freely available online. Academic books are different from papers in terms of their length, contents and structure. We argue that accounting for academic books is important in understanding and assessing scientific impact. We introduce an open-book search engine that extracts and indexes metadata, contents, and bibliography from online PDF book documents. To the best of our knowledge, no previous work gives a systematical study on building a search engine for books.We propose a hybrid approach for extracting title and authors from a book that combines results from CiteSeer, a rule based extractor, and a SVM based extractor, leveraging web knowledge. For "table of contents" recognition, we propose rules based on multiple regularities based on numbering and ordering. In addition, we study bibliography extraction and citation parsing for a large dataset of books. Finally, we use the multiple fields available in books to rank books in response to search queries. Our system can effectively extract metadata and contents from large collections of online books and provides efficient book search and retrieval facilities.

show abstract

Structure extraction from PDF-based book documents

Cited by 25 publications

References 25 publications

Extracting and matching authors and affiliations in scholarly documents

Extracting and matching authors and affiliations in scholarly documents

Transforming Japanese archives into accessible digital books

Searching online book documents and analyzing book citations

Contact Info

Product

Resources

About