Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case Study

Paramonov, Viacheslav; Shigarov, Alexey O.; Vetrova, Varvara

doi:10.1007/978-3-030-88304-1_7

Cited by 3 publications

(2 citation statements)

References 17 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Table locations in sheets are unknown in general. Moreover, the physical structure of hand-coded tables is often inappropriate for automatic processing (Paramonov et al, 2020(Paramonov et al, , 2021. For example, the visual formatting (cell borders and text arrangement) allows presentation of two or more adjacent machine-readable cells as one human-readable cell, and vice versa, one machinereadable cell can be actually read by humans as several cells.…”

Section: Tabular Data Sourcesmentioning

confidence: 99%

See 1 more Smart Citation

Table understanding: Problem overview

Shigarov

2022

WIREs Data Min & Knowl

Self Cite

View full text Add to dashboard Cite

Tables are probably the most natural way to represent relational data in various media and formats. They store a large number of valuable facts that could be utilized for question answering, knowledge base population, natural language generation, and other applications. However, many tables are not accompanied by semantics for the automatic interpretation of the information they present. Table Understanding (TU) aims at recovering the missing semantics that enables the extraction of facts from tables. This problem covers a range of issues from table detection in document images to semantic table interpretation with the help of external knowledge bases. To date, the TU research has been ongoing on for 30 years. Nevertheless, there is no common point of view on the scope of TU; the terminology still needs agreement and unification. In recent years, science and technology have shown a rapidly increasing interest in TU. Nowadays, it is especially important to check the meaning of this research problem once again. This article gives a comprehensive characterization of the TU problem, including a description of its subproblems, tasks, subtasks, and applications. It also discusses the common limitations used in the existing problem statements and proposes some directions for further research that would help overcome the corresponding limitations. This article is categorized under: Algorithmic Development > Text Mining Algorithmic Development > Web Mining

show abstract

Section: Tabular Data Sourcesmentioning

confidence: 99%

“…In the STE scenario, TSR should be postulated as the correction of a sheet grid corresponding to the visual representation of a table. The main goal is to make machine-readable cells identical to human-readable cells (Paramonov et al, 2020(Paramonov et al, , 2021.…”

Section: Table Structure Recognitionmentioning

confidence: 99%

Table understanding: Problem overview

Shigarov

2022

WIREs Data Min & Knowl

Self Cite

View full text Add to dashboard Cite

show abstract

HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets

Chen

et al. 2022

2022 IEEE International Conference on Knowledge Graph (ICKG)

View full text Add to dashboard Cite

HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets

Chen

et al. 2023

Data Intelligence

View full text Add to dashboard Cite

Spreadsheets contain a lot of valuable data and have many practical applications. The key technology of these practical applications is how to make machines understand the semantic structure of spreadsheets, e.g., identifying cell function types and discovering relationships between cell pairs. Most existing methods for understanding the semantic structure of spreadsheets do not make use of the semantic information of cells. A few studies do, but they ignore the layout structure information of spreadsheets, which affects the performance of cell function classification and the discovery of different relationship types of cell pairs. In this paper, we propose a Heuristic algorithm for Understanding the Semantic Structure of spreadsheets (HUSS). Specifically, for improving the cell function classification, we propose an error correction mechanism (ECM) based on an existing cell function classification model [11] and the layout features of spreadsheets. For improving the table structure analysis, we propose five types of heuristic rules to extract four different types of cell pairs, based on the cell style and spatial location information. Our experimental results on five real-world datasets demonstrate that HUSS can effectively understand the semantic structure of spreadsheets and outperforms corresponding baselines.

show abstract

Rule Driven Spreadsheet Data Extraction from Statistical Tables: Case Study

Cited by 3 publications

References 17 publications

Table understanding: Problem overview

Table understanding: Problem overview

HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets

HUSS: A Heuristic Method for Understanding the Semantic Structure of Spreadsheets

Contact Info

Product

Resources

About