“…Information extraction approaches must handle varying layouts, semantic fields and multiple input modalities at the intersection of computer vision, natural language processing and information retrieval. While there has been progress on the task [4,7,14,15,18,19,25,34], there is no publicly available large-scale benchmark to train and compare these approaches, an issue that has been noted by several authors [5,16,24,26,29]. Existing approaches are trained on privately collected datasets, hindering their reproducibility, fair comparisons and tracking field progression [11,23,24].…”