2022
DOI: 10.1007/978-3-030-94343-1_2
|View full text |Cite
|
Sign up to set email alerts
|

Creating Unbiased Public Benchmark Datasets with Data Leakage Prevention for Predictive Process Monitoring

Abstract: Advances in AI, and especially machine learning, are increasingly drawing research interest and efforts towards predictive process monitoring, the subfield of process mining (PM) that concerns predicting next events, process outcomes and remaining execution times. Unfortunately, researchers use a variety of datasets and ways to split them into training and test sets. The documentation of these preprocessing steps is not always complete. Consequently, research results are hard or even impossible to reproduce an… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
6
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

0
7

Authors

Journals

citations
Cited by 12 publications
(6 citation statements)
references
References 8 publications
0
6
0
Order By: Relevance
“…Additionally, test sets are often biased, and these challenges present obstacles to the advancement of the field. 50 The process of developing predictive machine-learning models in this study involved using the training dataset, which comprised 80% of the total dataset, and optimizing various hyperparameters and parameters using methods via grid search.…”
Section: Performance Evaluation Methods Of Machine-learning Modelsmentioning
confidence: 99%
See 1 more Smart Citation
“…Additionally, test sets are often biased, and these challenges present obstacles to the advancement of the field. 50 The process of developing predictive machine-learning models in this study involved using the training dataset, which comprised 80% of the total dataset, and optimizing various hyperparameters and parameters using methods via grid search.…”
Section: Performance Evaluation Methods Of Machine-learning Modelsmentioning
confidence: 99%
“…It has been stated that in the field of predictive process monitoring, the separation between training and test sets is often incomplete, leading to a problem of data leakage. Additionally, test sets are often biased, and these challenges present obstacles to the advancement of the field 50 . The process of developing predictive machine‐learning models in this study involved using the training dataset, which comprised 80% of the total dataset, and optimizing various hyperparameters and parameters using methods via grid search.…”
Section: Dataset and Methodologies Used In The Data Mining Processmentioning
confidence: 99%
“…We take the time attribute on the scale of seconds granularity and transform it to relative time expressing the duration of events (by the difference of two consecutive event timestamps). This also prevents a particular data leakage from the test set as indicated by [29]. The time attribute value corresponding to any special symbols for the activity label (e.g.…”
Section: Embeddingmentioning
confidence: 99%
“…The performance of suffix prediction was evaluated only on average performance measures for a few real-life event logs [18]. In addition, different pre-processing and evaluation strategies have been used which makes it difficult to compare the results of the different approaches [18,29]. Our work contributes a unified framework for case suffix prediction with seven sequential DL models, multimodal event attribute fusion, and multi-target prediction (Section 3).…”
Section: Introductionmentioning
confidence: 99%
“…Previous work in PPM faces a few challenges: unclear training/test set splits and pre-processing schemes, which leads to noticeably different findings and challenges w.r.t reproducibility [33]. Since datasets often contain duplicates which are not respected by the evaluation schemes, results are often confounded.…”
Section: Introductionmentioning
confidence: 99%