2022
DOI: 10.1109/access.2022.3202639
|View full text |Cite
|
Sign up to set email alerts
|

FormulaNet: A Benchmark Dataset for Mathematical Formula Detection

Abstract: One unsolved sub-task of document analysis is mathematical formula detection (MFD).Research by ourselves and others has shown that existing MFD datasets with inline and display formula labels are small and have insufficient labeling quality. There is therefore an urgent need for datasets with better quality labeling for future research in the MFD field, as they have a high impact on the performance of the models trained on them. We present an advanced labeling pipeline and a new dataset called FormulaNet in th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 20 publications
0
6
0
Order By: Relevance
“…For our upcoming research steps, we plan to combine For-mulaNet [32] and MathNet to develop a semi-automatic captioning system for MEs in PDFs. With this system, we expect to significantly improve the accessibility of PDFs specifically for MEs and also enable easy searching and extracting of MEs from PDFs.…”
Section: Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…For our upcoming research steps, we plan to combine For-mulaNet [32] and MathNet to develop a semi-automatic captioning system for MEs in PDFs. With this system, we expect to significantly improve the accessibility of PDFs specifically for MEs and also enable easy searching and extracting of MEs from PDFs.…”
Section: Discussionmentioning
confidence: 99%
“…We will demonstrate the influence of the resolution on the model performance in Section V-A. The resulting im2latexv2 dataset contains fewer MEs than the original im2latex-100k due to our rendering pipeline, which includes four check criteria (see Algorithm By using the Mathematical Formula Detection model from Schmitt-Koopmann et al [32], we collected over 250k ME from randomly selected arXiv papers with 600 DPI and selected 200 MEs at random for manual annotation. As shown in Table 2 we deleted 69 MEs where the image was larger than 768x2400 pixels.…”
Section: A Im2latexv2mentioning
confidence: 99%
“…DocBank (Li et al, 2020b) extended from TableBank, provides token-level finegrained categories labeling. FormulaNet (Schmitt-Koopmann et al, 2022) and IBEM (Anitei et al, 2023) focus on mathematical formulas, especially in-line formulas, which can easily be confused with plain-texts. SciBank (Grijalva et al, 2022) produces block-level annotations.…”
Section: Document Datasetsmentioning
confidence: 99%
“…In this context, L A T E X code has emerged as a valuable resource. Many of the weakly supervised annotated document IE datasets have their genesis in L A T E X code (Li et al, 2020b;Schmitt-Koopmann et al, 2022;Anitei et al, 2023).…”
Section: Introductionmentioning
confidence: 99%
“…PDF creators can include or add these accessibility features manually when making their documents, or can choose from a number of tools which automate or partially automate this process (Darvishy, 2018; Darvishy et al, 2012; Darvishy & Hutter, 2013; Doblies et al, 2014). Some new research is also investigating the potential of artificial intelligence to automate document accessibility (Darvishy et al, 2016; Schmitt‐Koopmann, Huang, & Darvishy, 2022; Schmitt‐Koopmann, Huang, Hutter, et al, 2022). For PDFs created by scanning physical pages, OCR (Optical Character Recognition) technology can convert them into a standard machine‐readable PDF format.…”
Section: Introductionmentioning
confidence: 99%