A new method of table structure analysis based on cell classification and cell modification is proposed in this paper as the basis of an OCR which can convert a variety of printed tables into XML documents in accordance with a specified XML schema. The outline of this method is described as follows. Firstly, cell features defined by ruled lines, which correspond to data fields, are extracted from the input image of a table. After that, each cell is classified to identify the irregular table whose ruled lines are not gridded and is modified to form regular cell arrangement. Next, the hierarchical table structure consisting of a regular row structure of cells is extracted from the modified regular table and is described using a DOM tree. In this case, logical objects within a cell are extracted and are converted into a sub-tree in the DOM tree. Finally, this DOM tree is transformed into a target XML document by an XML parser with information extraction process. Experimental results show the method is effective in transforming various printed tables to various XML documents.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.