Figure and Figure Caption Extraction for Mixed Raster and Vector PDFs: Digitization of Astronomical Literature with OCR Features

Naiman, Jill; Williams, Peter K. G.; Goodman, Alyssa A.

doi:10.1007/978-3-031-16802-4_5

Cited by 5 publications

(13 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, as shown by the solid and dashed orange lines of Figure 1, this is not the case. While the F1 score for the detectron2 model is higher than ours for figures, it does less well for figure captions (though, as discussed in [12] this may be due to inconsistencies between "caption" and "text" classes, the latter being the only class [12,14] (blue lines) and detectron2 [16] trained on the PubLayNet document dataset [17] (orange lines). Neither model generalizes well to the HathiTrust set of documents.…”

Section: The Problem Of Generalizabilitymentioning

confidence: 56%

“…Our prior work was aimed at the extraction of figures and their captions from a subset of the "pre-digital" astrophysical literature holdings of the Astrophysics Data System (ADS) 1 using both grayscale and optical character recognition (OCR) features of article pages [12]. Our model produced a high level of accuracy on our dataset -for an intersection-over-union (IOU) metric of 0.9 we found F1 scores of ≥ 90% 2 [12,14].…”

Section: The Problem Of Generalizabilitymentioning

confidence: 96%

“…A change in not only publication type, but simply publication year can drastically lower the accuracy of page object extraction methods for models that are not explicitly trained on this type of document [10,12,14]. Our prior work was aimed at the extraction of figures and their captions from a subset of the "pre-digital" astrophysical literature holdings of the Astrophysics Data System (ADS) 1 using both grayscale and optical character recognition (OCR) features of article pages [12]. Our model produced a high level of accuracy on our dataset -for an intersection-over-union (IOU) metric of 0.9 we found F1 scores of ≥ 90% 2 [12,14].…”

Section: The Problem Of Generalizabilitymentioning

confidence: 99%

“…To give our model the best chance of success in our comparison, we subset this collection with the search fields of "astronomy" and filter the results to include only English-language Conference, Journal and Manuscript documents with "Full View" available, bringing the total records to 56,282. For our illustration, we select ≈350 randomly chosen pages from six articles within this filtered collection which we annotate with figure and caption class definitions from [12,14] in Make-Sense.ai 4 [15]. The precision, recall and F1 score at different IOU cut-offs for our model applied to this data are shown by solid and dashed navy blue lines in Figure 1.…”

Section: The Problem Of Generalizabilitymentioning

confidence: 99%

“…In the cases of borndigital documents, these deep learning methods are combined with heuristically-derived results in post-processing steps [9]. However, for historical scanned documents, these methods present many challenges [10][11][12].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Generalizability in Document Layout Analysis for Scientific Article Figure & Caption Extraction

Naiman¹

2023

Preprint

View full text Add to dashboard Cite

The lack of generalizability -in which a model trained on one dataset cannot provide accurate results for a different dataset -is a known problem in the field of document layout analysis. Thus, when a model is used to locate important page objects in scientific literature such as figures, tables, captions, and math formulas, the model often cannot be applied successfully to new domains. While several solutions have been proposed, including newer and updated deep learning models, larger handannotated datasets, and the generation of large synthetic datasets, so far there is no "magic bullet" for translating a model trained on a particular domain or historical time period to a new field. Here we present our ongoing work in translating our document layout analysis model from the historical astrophysical literature to the larger corpus of scientific documents within the HathiTrust U.S. Federal Documents collection. We use this example as an avenue to highlight some of the problems with generalizability in the document layout analysis community and discuss several challenges and possible solutions to address these issues. All code for this work is available on The Reading Time Machine GitHub repository, https://github.com/ReadingTimeMachine/htrc short conf.

show abstract

Section: The Problem Of Generalizabilitymentioning

confidence: 56%

Section: The Problem Of Generalizabilitymentioning

confidence: 96%