2019
DOI: 10.1007/s10618-019-00646-y
|View full text |Cite
|
Sign up to set email alerts
|

Wrangling messy CSV files by detecting row and type patterns

Abstract: It is well known that data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently, so each file requires manual inspection and potentially repair before the data can be loaded, an enormous waste of … Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
17
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
5
2

Relationship

0
7

Authors

Journals

citations
Cited by 23 publications
(17 citation statements)
references
References 19 publications
0
17
0
Order By: Relevance
“…Given the proximity of such users to actual databases, GitHub is a rich source for heterogeneous tables. Prior analyses of CSV files from GitHub also found that these files have diverse formatting and the tables extracted from them have relatively large dimensions [35,19]. These properties are common across database contexts [35,20], so that we consider CSV files from GitHub a suitable resource for database-like tables (C2).…”
Section: Design Principles Of Gittablesmentioning
confidence: 99%
“…Given the proximity of such users to actual databases, GitHub is a rich source for heterogeneous tables. Prior analyses of CSV files from GitHub also found that these files have diverse formatting and the tables extracted from them have relatively large dimensions [35,19]. These properties are common across database contexts [35,20], so that we consider CSV files from GitHub a suitable resource for database-like tables (C2).…”
Section: Design Principles Of Gittablesmentioning
confidence: 99%
“…While parsing CSV files in the standard format [29] is easy, parsing a file with non-standard column separators and other formatting parameters often requires human insight. CleverCSV [5] is an automatic tool that uses a data consistency measure to determine formatting parameters, called a "dialect", consisting of the delimiter (e.g., ,), quote (e.g., ") and escape characters (e.g., \). We adapt CleverCSV into an interactive AI assistant that allows the analyst to guide the tool in case the automatic detection fails.…”
Section: Clevercsv: Parsing Tabular Data Filesmentioning
confidence: 99%
“…Objective function optimization. The objective function Q H for the AI assistant does not depend on user interactions and uses the consistency measure of non-interactive CleverCSV [5]. The measure is calculated by parsing the input file using a potential dialect and taking the product of two scores: the "pattern score" that captures how regular the structure of the parsed data is (i.e., does the resulting table have the same number of cells in each row?…”
Section: Clevercsv: Parsing Tabular Data Filesmentioning
confidence: 99%
See 2 more Smart Citations