2022
DOI: 10.1007/978-3-031-16802-4_5
|View full text |Cite
|
Sign up to set email alerts
|

Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Abstract: Scientific articles published prior to the "age of digitization" in the late 1990s contain figures which are "trapped" within their scanned pages. While progress to extract figures and their captions has been made, there is currently no robust method for this process. We present a YOLO-based method for use on scanned pages, post-Optical Character Recognition (OCR), which uses both grayscale and OCR-features. When applied to the astrophysics literature holdings of the Astrophysics Data System (ADS), we find F1 … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
12
0
1

Year Published

2023
2023
2024
2024

Publication Types

Select...
2
1
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(13 citation statements)
references
References 50 publications
0
12
0
1
Order By: Relevance
“…However, as shown by the solid and dashed orange lines of Figure 1, this is not the case. While the F1 score for the detectron2 model is higher than ours for figures, it does less well for figure captions (though, as discussed in [12] this may be due to inconsistencies between "caption" and "text" classes, the latter being the only class [12,14] (blue lines) and detectron2 [16] trained on the PubLayNet document dataset [17] (orange lines). Neither model generalizes well to the HathiTrust set of documents.…”
Section: The Problem Of Generalizabilitymentioning
confidence: 56%
See 4 more Smart Citations
“…However, as shown by the solid and dashed orange lines of Figure 1, this is not the case. While the F1 score for the detectron2 model is higher than ours for figures, it does less well for figure captions (though, as discussed in [12] this may be due to inconsistencies between "caption" and "text" classes, the latter being the only class [12,14] (blue lines) and detectron2 [16] trained on the PubLayNet document dataset [17] (orange lines). Neither model generalizes well to the HathiTrust set of documents.…”
Section: The Problem Of Generalizabilitymentioning
confidence: 56%
“…Our prior work was aimed at the extraction of figures and their captions from a subset of the "pre-digital" astrophysical literature holdings of the Astrophysics Data System (ADS) 1 using both grayscale and optical character recognition (OCR) features of article pages [12]. Our model produced a high level of accuracy on our dataset -for an intersection-over-union (IOU) metric of 0.9 we found F1 scores of ≥ 90% 2 [12,14].…”
Section: The Problem Of Generalizabilitymentioning
confidence: 96%
See 3 more Smart Citations