“…Through ablation experiments we find the combination of the page and hOCR properties of (grayscale, ascenders, decenders, word confidences, fraction of numbers in a word, fraction of letters in a word, punctuation, word rotation and spaCy POS) maximize our model's performance. When compared to other deep learning models popular for document layout analysis (ScanBank [21,47] and de-tectron2 [45]) we find our model performs better on our dataset, particularly at the high IOU thresholds (IOU=0.9) and especially for figure captions. In particular, in line with our extraction goals, our model has relatively low false positive rates, minimizing the extraction of erroneous page objects.…”