Types from data: making structured data first-class citizens in F#

Petříček, Tomáš; Guerra, Gustavo; Syme, Don

doi:10.1145/2908080.2908115

Cited by 17 publications

(11 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…On column type inference, we compare our method with F# (Petricek et al, 2016), hypoparsr (Döhmen et al, 2017a), messytables (Lindenberg, 2017), readr (Wickham et al, 2017), TDDA (Stochastic Solutions, 2018) and Trifacta (2018). Note that some of the related works are not directly applicable for this task, and these are not included in these experiments.…”

Section: Methodsmentioning

confidence: 99%

“…However, this leads to a poor type detection performance since the Pandas reader is not robust against missing data and anomalies, where only empty string, NaN, and NULL are treated as missing data. Petricek et al (2016) propose another use of regular expressions with F#, where types, referred as shapes, are inferred w.r.t. a set of preferred shape relations.…”

Section: Related Workmentioning

confidence: 99%

“…Our focus here is on inferring the data type for each column in a table of data. Numerous studies have attempted to tackle type inference, including wrangling tools (Raman and Hellerstein, 2001;Kandel et al, 2011;Guo et al, 2011;Trifacta, 2018;Fisher and Gruber, 2005;Fisher et al, 2008), software packages (Petricek et al, 2016;Lindenberg, 2017;Stochastic Solutions, 2018;Döhmen et al, 2017a;Wickham et al, 2017), and probabilistic approaches (Valera and Ghahramani, 2017;Vergari et al, 2019;Limaye et al, 2010). However, often they do not work very well in the presence of missing and anomalous data, which are commonly found in raw data sets due to the lack of a well-organized data collection procedure.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ptype: probabilistic type inference

Ceritli

Williams

Geddes

2020

Data Min Knowl Disc

View full text Add to dashboard Cite

Type inference refers to the task of inferring the data type of a given column of data. Current approaches often fail when data contains missing data and anomalies, which are found commonly in real-world data sets. In this paper, we propose ptype, a probabilistic robust type inference method that allows us to detect such entries, and infer data types. We further show that the proposed method outperforms the existing methods.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ptype: probabilistic type inference

Ceritli

Williams

Geddes

2020

Data Min Knowl Disc

View full text Add to dashboard Cite

show abstract

“…However, The Gamma supports type providers [9,54], which can generate types based on an external file or a REST service call, e.g. [47]. For this reason, type checking can be relatively time consuming and can benefit from the same caching facilities as those available for instant previews.…”

Section: Type Checkingmentioning

confidence: 99%

Foundations of a live data exploration environment

Petříček

2020

Programming

Self Cite

View full text Add to dashboard Cite

A growing amount of code is written to explore and analyze data, often by data analysts who do not have a traditional background in programming, for example by journalists. The way such data anlysts write code is different from the way software engineers do so. They use few abstractions, work interactively and rely heavily on external libraries. We aim to capture this way of working and build a programming environment that makes data exploration easier by providing instant live feedback.We combine theoretical and applied approach. We present the data exploration calculus, a formal language that captures the structure of code written by data analysts. We then implement a data exploration environment that evaluates code instantly during editing and shows previews of the results. We formally describe an algorithm for providing instant previews for the data exploration calculus that allows the user to modify code in an unrestricted way in a text editor. Supporting interactive editing is tricky as any edit can change the structure of code and fully recomputing the output would be too expensive. We prove that our algorithm is correct and that it reuses previous results when updating previews after a number of common code edit operations. We also illustrate the practicality of our approach with an empirical evaluation and a case study.As data analysis becomes an ever more important use of programming, research on programming languages and tools needs to consider new kinds of programming workflows appropriate for those domains and conceive new kinds of tools that can support them. The present paper is one step in this important direction. ACM CCS 2012Human-centered computing → Interactive systems and tools; Information systems → Data mining;Software and its engineering → Compilers;Foundations of a live data exploration environment development and execution. Data analysts write small snippets of code, run them to see results immediately and then revise them.Notebooks are used by users ranging from scientists who implement complex models of physical systems to journalists who perform simple data aggregations and create visualizations. Our focus is on the simplest use cases. Making programmatic data exploration more spreadsheet-like should encourage users to choose programming tools over spreadsheets, resulting in more reproducible and transparent data analyses.Consider the Financial Times analysis of plastic waste [7,25]. It joins datasets from Eurostat, UN Comtrade and more, aggregates the data and builds a visualization comparing waste flows in 2017 and 2018. Figure 1 shows an excerpt from one notebook of the data analysis. The code has a number of important properties:There is no abstraction. The analysis uses lambda functions as arguments to library calls, but it does not define any custom functions. Code is parameterized by having a global variable material set to "plastics" and keeping other possible values in a comment. This lets the analyst run and check results of intermediate steps.The code relies on external lib...

show abstract

“…Scribble-Refined Recent work [27] has extended Scribble specifically for F# to leverage the language's support for refinement types [17] and type providers [31]. The authors present an expressive refinement language to specify boolean refinements related to messages being sent.…”

Section: Related Workmentioning

confidence: 99%

Value-Dependent Session Design in a Dependently Typed Language

Muijnck-Hughes

Brady

Vanderbauwhede

2019

Electron. Proc. Theor. Comput. Sci.

View full text Add to dashboard Cite

Session Types offer a typing discipline that allows protocol specifications to be used during typechecking, ensuring that implementations adhere to a given specification. When looking to realise global session types in a dependently typed language care must be taken that values introduced in the description are used by roles that know about the value.We present Sessions, a Resource Dependent Embedded Domain Specific Language (EDSL) for describing global session descriptions in the dependently typed language Idris. As we construct session descriptions the values parameterising the EDSLs' type keeps track of roles and messages they have encountered. We can use this knowledge to ensure that message values are only used by those who know the value. Sessions supports protocol descriptions that are computable, composable, higher-order, and value-dependent. We demonstrate Sessions expressiveness by describing the TCP Handshake, a multi-modal server providing echo and basic arithmetic operations, and a Higher-Order protocol that supports an authentication interaction step.

show abstract

Types from data: making structured data first-class citizens in F#

Cited by 17 publications

References 23 publications

ptype: probabilistic type inference

ptype: probabilistic type inference

Foundations of a live data exploration environment

Value-Dependent Session Design in a Dependently Typed Language

Contact Info

Product

Resources

About