ptype: probabilistic type inference

Ceritli, Taha; Williams, Christopher K. I.; Geddes, James

doi:10.1007/s10618-020-00680-1

Cited by 8 publications

(20 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Background: In this work, we extend the probabilistic type inference method called ptype [3]. Assuming that the data entries are read as strings, ptype allows us to infer a plausible column type (Boolean, date, float, integer or string) for a data column, and, conditioned on that type, identify any values which are deemed missing or anomalous.…”

Section: Methodsmentioning

confidence: 99%

“…For example, in a data table about clothing, a variable "Class Name" could be a categorical variable taking on values such as Jackets, Dresses and Pants, while a variable "Rating" may take on values in a fixed range 1 through 5. 3 To the best of our knowledge, these issues are not addressed by any existing work in the literature, except Bot (proposed by Majoor and Vanschoren [1]), OpenML and Weka which tackle the type inference based on heuristics such as labeling a column as categorical when the number of unique values is lower than a threshold (see Sec. 3 for a detailed discussion).…”

Section: Introductionmentioning

confidence: 99%

“…-We define inference of categorical values as the task of identifying the possible values a categorical variable can take on. We address this task by adapting ptype [3] which can robustly determine the possible values of a categorical variable by identifying missing data and anomalies in a data column (Section 2). -We show that the our methods outperform the existing methods using a large number of datasets (Section 4).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

ptype-cat: Inferring the Type and Values of Categorical Variables

Ceritli¹,

Williams²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

Type inference is the task of identifying the type of values in a data column and has been studied extensively in the literature. Most existing type inference methods support data types such as Boolean, date, float, integer and string. However, these methods do not consider non-Boolean categorical variables, where there are more than two possible values encoded by integers or strings. Therefore, such columns are annotated either as integer or string rather than categorical, and need to be transformed into categorical manually by the user. In this paper, we propose a probabilistic type inference method that can identify the general categorical data type (including non-Boolean variables). Additionally, we identify the possible values of each categorical variable by adapting the existing type inference method ptype. Combining these methods, we present ptype-cat which achieves better results than existing applicable solutions.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

ptype-cat: Inferring the Type and Values of Categorical Variables

Ceritli¹,

Williams²

2021

Preprint

Self Cite

View full text Add to dashboard Cite

show abstract

“…Automatically discovering the statistical and semantic types of data in tables is a valuable tool in data preparation and information retrieval. Accordingly, methods have been presented that predict the type of a column [3,4]. These methods expect the values in a column to have the same type.…”

Section: Single Column Type Detectionmentioning

confidence: 99%

“…Table segmentation is related to statistical and semantic type detection, where the goal is to find the data type of a set of values. Unlike our unsupervised segmentation approach, type detection generally works in a predictive setting, where the goal is to classify the statistical type of columns or to annotate them with semantic types [3,4]. As data is assumed to be grouped in sets of values that share a distinctive type, table segmentation can serve as a preprocessing step.…”

Section: Related Workmentioning

confidence: 99%

Muppets: Multipurpose Table Segmentation

Verbruggen

Contreras-Ochando

Ferri

et al. 2021

Advances in Intelligent Data Analysis XIX

View full text Add to dashboard Cite

We present muppets, a framework for partitioning cells in a table in segments that fulfil the same semantic role or belong to the same semantic data type, similar to how image segmentation is used to group pixels that represent the same semantic object in computer vision. Flexible constraints can be imposed on these segmentations for different use cases. muppets uses a hierarchical merge tree algorithm, which allows for efficiently finding segmentations that satisfy given constraints and only requires similarities between neighbouring cells to be computed. Three applications are used to illustrate and evaluate muppets: identifying tables and headers, type detection and discovering semantic errors.

show abstract

Automating Data Science

Brazdil

Rijn

Soares³

et al. 2022

Cognitive Technologies

View full text Add to dashboard Cite

It has been observed that, in data science, a great part of the effort usually goes into various preparatory steps that precede model-building. The aim of this chapter is to focus on some of these steps. A comprehensive description of a given task to be resolved is usually supplied by the domain expert. Techniques exist that can process natural language description to obtain task descriptors (e.g., keywords), determine the task type, the domain, and the goals. This in turn can be used to search for the required domain-specific knowledge appropriate for the given task. In some situations, the data required may not be available and a plan needs to be elaborated regarding how to get it. Although not much research has been done in this area so far, we expect that progress will be made in the future. In contrast to this, the area of preprocessing and transformation has been explored by various researchers. Methods exist for selection of instances and/or elimination of outliers, discretization and other kinds of transformations. This area is sometimes referred to as data wrangling. These transformations can be learned by exploiting existing machine learning techniques (e.g., learning by demonstration). The final part of this chapter discusses decisions regarding the appropriate level of detail (granularity) to be used in a given task. Although it is foreseeable that further progress could be made in this area, more work is needed to determine how to do this effectively.

show abstract

ptype: probabilistic type inference

Cited by 8 publications

References 27 publications

ptype-cat: Inferring the Type and Values of Categorical Variables

ptype-cat: Inferring the Type and Values of Categorical Variables

Muppets: Multipurpose Table Segmentation

Automating Data Science

Contact Info

Product

Resources

About