Spreadsheet Property Detection With Rule-assisted Active Learning

Chen, Zhe; Dadiomov, Sasha; Wesley, Richard; Xiao, Gang; Cory, Daniel; Cafarella, Michael; Mackinlay, Jock D.

doi:10.1145/3132847.3132882

Cited by 25 publications

(24 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…There is a considerable number of works tackling layout inference and information extraction in spreadsheets. Recent publications propose approaches involving to some extent machine learning techniques, such as [2], [3], [4], [5], and [6]. Also, we find rule-based approaches, like [7].…”

Section: Related Workmentioning

confidence: 80%

See 1 more Smart Citation

Table Recognition in Spreadsheets via a Graph Representation

Koci

Thiele

Lehner

et al. 2018

2018 13th IAPR International Workshop on Document Analysis Systems (DAS)

View full text Add to dashboard Cite

Spreadsheet software are very popular data management tools. Their ease of use and abundant functionalities equip novices and professionals alike with the means to generate, transform, analyze, and visualize data. As a result, spreadsheets are a great resource of factual and structured information. This accentuates the need to automatically understand and extract their contents. In this paper, we present a novel approach for recognizing tables in spreadsheets. Having inferred the layout role of the individual cells, we build layout regions. We encode the spatial interrelations between these regions using a graph representation. Based on this, we propose Remove and Conquer (RAC), an algorithm for table recognition that implements a list of carefully curated rules. An extensive experimental evaluation shows that our approach is viable. We achieve significant accuracy in a dataset of real spreadsheets from various domains.

show abstract

Section: Related Workmentioning

confidence: 80%

“…While there is some support to perform spreadsheet data extraction, like [1] and [2], it can not be considered a general purpose solution for arbitrary inputs. Previous work often assumes just one table per sheet.…”

Section: Introductionmentioning

confidence: 99%

Table Recognition in Spreadsheets via a Graph Representation

Koci

Thiele

Lehner

et al. 2018

2018 13th IAPR International Workshop on Document Analysis Systems (DAS)

View full text Add to dashboard Cite

show abstract

“…VizNet currently centralizes four corpora of data from the web, open data portals, and online visualization galleries. We plan to expand the VizNet corpus with the 410,554 Microsoft Excel workbook files (1,181,530 sheets) [8] extracted from the ClueWeb09 web crawl 1 . Furthermore, Morton et.…”

Section: Discussionmentioning

confidence: 99%

VizNet

Gaikwad

Hulsebos

et al. 2019

Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

and developing benchmark models and algorithms for automating visual analysis. To demonstrate VizNet's utility as a platform for conducting online crowdsourced experiments at scale, we replicate a prior study assessing the influence of user task and data distribution on visual encoding effectiveness, and extend it by considering an additional task: outlier detection. To contend with running such studies at scale, we demonstrate how a metric of perceptual effectiveness can be learned from experimental results, and show its predictive power across test datasets. CCS CONCEPTS• Human-centered computing → Visualization design and evaluation methods; Visualization theory, concepts and paradigms; • Computing methodologies → Machine learning.

show abstract

“…Recent attempts, such as Ideas in Excel and Explore in Google Sheets, aim at providing insights and recommendations to users (e.g., summary statistics and charts), based on background analysis of tabular data in the sheet. Other works [1,3,5,17], including ours [10][11][12][13][14][15], focus on integrating and extracting data from spreadsheets. One of the main concerns comes with data and knowledge being scattered in multiple spreadsheet files.…”

Section: Introductionmentioning

confidence: 99%

“…https://ironpython.net/ 4 http://officeopenxml.com/anatomyofOOXML-xlsx.php5 https://openpyxl.readthedocs.io/en/stable/…”

mentioning

confidence: 99%

XLIndy

Koci

Kuban

Luettig

et al. 2019

Proceedings of the ACM Symposium on Document Engineering 2019

View full text Add to dashboard Cite

Over the years, spreadsheets have established their presence in many domains, including business, government, and science. However, challenges arise due to spreadsheets being partially-structured and carrying implicit (visual and textual) information. This translates into a bottleneck, when it comes to automatic analysis and extraction of information. Therefore, we present XLIndy, a Microsoft Excel add-in with a machine learning back-end, written in Python. It showcases our novel methods for layout inference and table recognition in spreadsheets. For a selected task and method, users can visually inspect the results, change configurations, and compare different runs. This enables iterative fine-tuning. Additionally, users can manually revise the predicted layout and tables, and subsequently save them as annotations. The latter is used to measure performance and (re-)train classifiers. Finally, data in the recognized tables can be extracted for further processing. XLIndy supports several standard formats, such as CSV and JSON.

show abstract

Spreadsheet Property Detection With Rule-assisted Active Learning

Cited by 25 publications

References 23 publications

Table Recognition in Spreadsheets via a Graph Representation

Table Recognition in Spreadsheets via a Graph Representation

VizNet

XLIndy

Contact Info

Product

Resources

About