Gerrit J. J. van den Burg scite author profile

Gerrit J. J. van den Burg

4Publications

20Citation Statements Received

83Citation Statements Given

How they've been cited

How they cite others

Affiliations

The Alan Turing Institute, Amazon (United Kingdom)

Publications

Order By: Most citations

Wrangling messy CSV files by detecting row and type patterns

Burg

Nazabal

Sutton

2019

Data Min Knowl Disc

View full text Add to dashboard Cite

It is well known that data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently, so each file requires manual inspection and potentially repair before the data can be loaded, an enormous waste of human effort for a task that should be one of the simplest parts of data science. The first and most essential step in retrieving data from CSV files is deciding on the dialect of the file, such as the cell delimiter and quote character. Existing dialect detection approaches are few and non-robust. In this paper, we propose a dialect detection method based on a novel measure of data consistency of parsed data files. Our method achieves 97% overall accuracy on a large corpus of realworld CSV files and improves the accuracy on messy CSV files by almost 22% compared to existing approaches, including those in the Python standard library.CSV is a textbook example of how not to design a textual file format.-The Art of Unix Programming, Raymond (2003).

show abstract

AI Assistants: A Framework for Semi-Automated Data Wrangling

Petříček

Burg

Nazabal

et al. 2023

IEEE Trans. Knowl. Data Eng.

View full text Add to dashboard Cite

show abstract

An Evaluation of Change Point Detection Algorithms

Burg¹,

Williams²

2020

Preprint

View full text Add to dashboard Cite

The Turing Change Point Dataset

Burg¹,

Williams²

2020

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.