Wrangling messy CSV files by detecting row and type patterns

Burg, Gerrit J. J. van den; Nazabal, Alfredo; Sutton, Charles

doi:10.1007/s10618-019-00646-y

Cited by 23 publications

(17 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Given the proximity of such users to actual databases, GitHub is a rich source for heterogeneous tables. Prior analyses of CSV files from GitHub also found that these files have diverse formatting and the tables extracted from them have relatively large dimensions [35,19]. These properties are common across database contexts [35,20], so that we consider CSV files from GitHub a suitable resource for database-like tables (C2).…”

Section: Design Principles Of Gittablesmentioning

confidence: 99%

GitTables: A Large-Scale Corpus of Relational Tables

Hulsebos¹,

Demiralp²,

Groth³

2021

Preprint

View full text Add to dashboard Cite

The practical success of deep learning has sparked interest in improving relational table tasks, like data search, with models trained on large table corpora. Existing corpora primarily contain tables extracted from HTML pages, limiting the capability to represent offline database tables. To train and evaluate high-capacity models for applications beyond the Web, we need additional resources with tables that resemble relational database tables.Here we introduce GitTables, a corpus of currently 1.7M relational tables extracted from GitHub. Our continuing curation aims at growing the corpus to at least 20M tables. We annotate table columns in GitTables with more than 2K different semantic types from Schema.org and DBpedia. Our column annotations consist of semantic types, hierarchical relations, range types and descriptions. The corpus is available at https://gittables.github.io. Our analysis of GitTables shows that its structure, content, and topical coverage differ significantly from existing table corpora. We evaluate our annotation pipeline on hand-labeled tables from the T2Dv2 benchmark and find that our approach provides results on par with human annotations. We demonstrate a use case of GitTables by training a semantic type detection model on it and obtain high prediction accuracy. We also show that the same model trained on tables from the Web generalizes poorly.Preprint. Under review.

show abstract

Section: Design Principles Of Gittablesmentioning

confidence: 99%

GitTables: A Large-Scale Corpus of Relational Tables

Hulsebos¹,

Demiralp²,

Groth³

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…While parsing CSV files in the standard format [29] is easy, parsing a file with non-standard column separators and other formatting parameters often requires human insight. CleverCSV [5] is an automatic tool that uses a data consistency measure to determine formatting parameters, called a "dialect", consisting of the delimiter (e.g., ,), quote (e.g., ") and escape characters (e.g., \). We adapt CleverCSV into an interactive AI assistant that allows the analyst to guide the tool in case the automatic detection fails.…”

Section: Clevercsv: Parsing Tabular Data Filesmentioning

confidence: 99%

“…Objective function optimization. The objective function Q H for the AI assistant does not depend on user interactions and uses the consistency measure of non-interactive CleverCSV [5]. The measure is calculated by parsing the input file using a potential dialect and taking the product of two scores: the "pattern score" that captures how regular the structure of the parsed data is (i.e., does the resulting table have the same number of cells in each row?…”

Section: Clevercsv: Parsing Tabular Data Filesmentioning

confidence: 99%

“…Example of using CleverCSV. While the automatic dialect detection proposed in [5] achieves 97% accuracy, one type of failure arises when there are two delimiters that result in consistent row lengths and interpretable cells:…”

Section: Clevercsv: Parsing Tabular Data Filesmentioning

confidence: 99%

“…Jupyter and RStudio are popular environments used for programmatic data cleaning. They are used alongside libraries that implement specific functionality such as parsing CSV files or merging datasets [5], [6] and general data transformation functions provided, e.g., by Pandas [7] and Tidyverse [8].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

AI Assistants: A Framework for Semi-Automated Data Wrangling

Petříček

Burg

Nazabal

et al. 2023

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

Table understanding approaches for extracting knowledge from heterogeneous tables

Bonfitto

Casiraghi

Mesiti

2021

WIREs Data Min & Knowl

View full text Add to dashboard Cite

Table understanding methods extract, transform, and interpret the information contained in tabular data embedded in documents/files of different formats. Such automatic understanding would allow to exploit tabular information with the aim of accurately answering queries, or integrating heterogeneous repositories of information in a common knowledge base, or exchanging information among different sources. The purpose of this survey is to provide a comprehensive analysis of the research efforts so far devoted to the problem of table understanding and to describe systems that support the transformation of heterogeneous tables into meaningful information. This article is categorized under: Application Areas > Data Mining Software Tools Technologies > Data Preprocessing Technologies > Structure Discovery and Clustering

show abstract

Wrangling messy CSV files by detecting row and type patterns

Cited by 23 publications

References 19 publications

GitTables: A Large-Scale Corpus of Relational Tables

GitTables: A Large-Scale Corpus of Relational Tables

AI Assistants: A Framework for Semi-Automated Data Wrangling

Table understanding approaches for extracting knowledge from heterogeneous tables

Contact Info

Product

Resources

About