2017
DOI: 10.1007/s10618-017-0520-3
|View full text |Cite
|
Sign up to set email alerts
|

Identifying consistent statements about numerical data with dispersion-corrected subgroup discovery

Abstract: Existing algorithms for subgroup discovery with numerical targets do not optimize the error or target variable dispersion of the groups they find. This often leads to unreliable or inconsistent statements about the data, rendering practical applications, especially in scientific domains, futile. Therefore, we here extend the optimistic estimator framework for optimal subgroup discovery to a new class of objective functions: we show how tight estimators can be computed efficiently for all functions that are det… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
37
0

Year Published

2017
2017
2022
2022

Publication Types

Select...
3
2
1
1

Relationship

3
4

Authors

Journals

citations
Cited by 34 publications
(37 citation statements)
references
References 23 publications
0
37
0
Order By: Relevance
“…One could instead partition the dataset into chemically similar catalyst subgroups via clustering algorithms and train a separate model on each subgroup, which can increase prediction accuracy by reducing the different physicochemical effects that each ML model must describe. As an alternative, local pattern search algorithms such as subgroup discovery (SGD) could be used to automatically find and describe subgroups …”
Section: Impact Of Machine Learning On Heterogeneous Catalysismentioning
confidence: 99%
“…One could instead partition the dataset into chemically similar catalyst subgroups via clustering algorithms and train a separate model on each subgroup, which can increase prediction accuracy by reducing the different physicochemical effects that each ML model must describe. As an alternative, local pattern search algorithms such as subgroup discovery (SGD) could be used to automatically find and describe subgroups …”
Section: Impact Of Machine Learning On Heterogeneous Catalysismentioning
confidence: 99%
“…Essentially two areas exist where the presence of numeric attributes requires attention: on the side of the target attribute(s) (in the case of a regression setting), and on the side of the description attributes (those attributes that are not targets, and are available to construct subgroups from). On the target side, several recent papers discuss the treatment of numeric target attributes (Atzmüller and Lemmerich 2009;Boley et al 2017;Lemmerich et al 2012Lemmerich et al , 2013Lemmerich et al , 2016, but all these papers describe methods that essentially assume nominal description attributes.…”
Section: Introductionmentioning
confidence: 99%
“…This allows to systematically reason about the described sub-domains (e.g., it is easy to determine their differences and overlap) and also to sample novel points from them. To specifically obtain regions where a given model has a decreased error, SGD algorithms 26 can be configured to yield a selector with maximum impact on the model error. The impact is defined as the product of selector coverage, i.e., the probability of the event = true, and the selector effect on the model error, i.e., the model error minus the model error given that the features satisfy the selector.…”
mentioning
confidence: 99%
“…On top of that, we compare the identified DA selectors across the six individual experiments to assess their stability. SGD is performed with non-redundant branch-and-bound search with tight optimistic estimators and pre-discretization of cut-off values by 5-means clustering as described in Ref 26. .…”
mentioning
confidence: 99%