An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents

Lopez, Luis D.; Yu, Jingyi; Arighi, Cecilia N.; Huang, Hongzhan; Shatkay, Hagit; Wu, Cathy H.

doi:10.1109/bibm.2011.26

Cited by 14 publications

(11 citation statements)

References 11 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Our PDFBox based extractor (E pdbx ) and 2. Xpdf based extractor (E xpdf ) reported in [4]. Results in table 1 shows that our system performs well in precision and recall.…”

Section: Experiments and Resultsmentioning

confidence: 51%

See 1 more Smart Citation

A figure search engine architecture for a chemistry digital library

Choudhury

Tuarob

Mitra

et al. 2013

Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries

View full text Add to dashboard Cite

Academic papers contain multiple figures representing important findings and experimental results; we present a search engine specifically focused on figures in academic documents. This search engine allows users to search on figures in approximately 150,000 chemistry journal articles though the method is easily extendable to other domains. Our system indexes figure caption and mentions extracted from the PDF in documents using a custom built extractor. Recall and precision performance of extracted figures is in the 80 to 90 % range. We give the frame work for the extraction algorithm, architecture and ranking function.

show abstract

“…Our PDFBox based extractor (E pdbx ) and 2. Xpdf based extractor (E xpdf ) reported in [4]. Results in table 1 shows that our system performs well in precision and recall.…”

Section: Experiments and Resultsmentioning

confidence: 51%

“…Recent work [4] describes a methodology for extraction of images and captions from PDF files, whereby images are extracted from PDF using Xpdf 5 and captions are extracted using regular expressions and heuristics. We use regular expressions and document layout information for the same task (section 4).…”

Section: Related Workmentioning

confidence: 99%

A figure search engine architecture for a chemistry digital library

Choudhury

Tuarob

Mitra

et al. 2013

Proceedings of the 13th ACM/IEEE-CS Joint Conference on Digital Libraries

View full text Add to dashboard Cite

show abstract

“…Our approach builds upon a new document parsing system [17] that can automatically extract figure-caption pairs from PDF articles. This allows us to efficiently recover figures and then associate them with the corresponding captions.…”

Section: Document Processingmentioning

confidence: 99%

An Image-Text Approach for Extracting Experimental Evidence of Protein-Protein Interactions in the Biomedical Literature

Lopez

Arighi

et al. 2013

Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics

Self Cite

View full text Add to dashboard Cite

Proteins are complex biological polymers that mediate virtually all cellular functions. Typically these functions are modulated by protein-protein interactions (PPI). Tremendous efforts have been made by life scientists to detect PPIs through different experimental approaches and document the results through publications. On the informatics front, however, there lacks an effective means for retrieving PPI information from published literatures. In this work we present a novel framework for identifying experimental methods employed for analyzing PPI from biomedical articles. Different from state-of-the-art approaches based only on text, we explore using the combination of attributes from figures, figure captions, and text within figures for identifying PPI experimental methods. Our work is motivated by the observation that biomedical figures often constitute direct evidence of experimental results and therefore provide complementary information to texts. We start with automatically extracting unimodal panels (subfigures) and their associated subcaptions and then classifying the subfigure into different types using a proposed hierarchical image taxonomy. Next, we combine the subfigure types with text-based features to form a hybrid feature descriptor and use it for PPI method classification. We further construct a dataset starting from a set of 2, 256 documents provided by the molecular interaction database MINT. Here we show that our new approach outperforms the text-only solution for associating figures with PPI methods.

show abstract

“…Specific approaches aiming to extract figures and captions from PDF documents have been recently proposed. Lopez et al [8] and Choudhury et al [9] introduced methods based on available tools (Xpdf and PDFBox respectively), but neither method handles vector graphics within scientific publications.…”

Section: Introductionmentioning

confidence: 99%

Extracting Figures and Captions from Scientific Publications

Jiang

Shatkay

2018

Proceedings of the 27th ACM International Conference on Information and Knowledge Management

Self Cite

View full text Add to dashboard Cite

Figures and captions convey essential information in scientific publications. As such, there is a growing interest in mining published figures and in utilizing their respective captions as a source of knowledge. There is also much interest in image captioning systems that can automatically generate captions for images, whose training requires large datasets of image-caption pairs. Notably, the first fundamental step of obtaining figures and captions from publications is neither well-studied nor yet well-addressed. In this paper, we introduce a new and effective system for figure and caption extraction, PDFigCapX. Unlike current methods that extract figures by handling raw encoded contents of PDF documents, we separate text from graphical contents and utilize layout information to detect and disambiguate figures and captions. Files containing the figures and their associated captions are then produced as output to the enduser. We test PDFigCapX on both a previously used generic dataset and on two new sets of publications within the biomedical domain. Our experiments and results show a significant improvement in performance compared to the state-of-the-art, and demonstrate the effectiveness of our approach. Our system will be available for use at: https://www.eecis.udel.edu/~compbio/PDFigCapX.

show abstract

An Automatic System for Extracting Figures and Captions in Biomedical PDF Documents

Cited by 14 publications

References 11 publications

A figure search engine architecture for a chemistry digital library

A figure search engine architecture for a chemistry digital library

An Image-Text Approach for Extracting Experimental Evidence of Protein-Protein Interactions in the Biomedical Literature

Extracting Figures and Captions from Scientific Publications

Contact Info

Product

Resources

About