Table understanding is a well studied problem in document analysis, and many academic and commercial approaches have been developed to recognize tables in several document formats, including plain text, scanned page images and born-digital, object-based formats such as PDF. Despite the abundance of these techniques, an objective comparison of their performance is still missing. The Table Competition held in the context of ICDAR 2013 is our first attempt at objectively evaluating these techniques against each other in a standardized way, across several input formats. The competition independently addresses three problems: (i) table location, (ii) table structure recognition, and (iii) these two tasks combined. We received results from seven academic systems, which we have also compared against four commercial products. This paper presents our findings.
The PDF format is commonly used for the exchange of documents on the Web and there is a growing need to understand and extract or repurpose data held in PDF documents. Many systems for processing PDF files use algorithms designed for scanned documents, which analyse a page based on its bitmap representation. We believe this approach to be inefficient. Not only does the rasterization step cost processing time, but information is also lost and errors can be introduced.Inspired primarily by the need to facilitate machine extraction of data from PDF documents, we have developed methods to extract textual and graphic content directly from the PDF content stream and represent it as a list of "objects" at a level of granularity suitable for structural understanding of the document. These objects are then grouped into lines, paragraphs and higher-level logical structures using a novel bottom-up segmentation algorithm based on visual perception principles. Experimental results demonstrate the viability of our approach, which is currently used as a basis for HTML conversion and data extraction methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.