2020
DOI: 10.48550/arxiv.2012.09258
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Detection of data drift and outliers affecting machine learning model performance over time

Abstract: A trained ML model is deployed on another 'test' dataset where target feature values (labels) are unknown. Drift is distribution change between the training and deployment data, which is concerning if model performance changes. For a cat/dog image classifier, for instance, drift during deployment could be rabbit images (new class) or cat/dog images with changed characteristics (change in distribution). We wish to detect these changes but can't measure accuracy without deployment data labels. We instead detect … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
14
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1
1
1

Relationship

0
5

Authors

Journals

citations
Cited by 12 publications
(14 citation statements)
references
References 5 publications
0
14
0
Order By: Relevance
“…This method aims to maximise the data quality by reducing various data issues, such as outliers, correlated features, skewed data, imbalanced categorical data and etc. [1,39,50]. This method also involves application of automated correction algorithms to correct the data issues, such as the SMOTE technique [18 ] can be used to mitigate the class imbalance problem or removal of redundant data using automated algorithms.…”
Section: Data Configuration Mechanismsmentioning
confidence: 99%
See 1 more Smart Citation
“…This method aims to maximise the data quality by reducing various data issues, such as outliers, correlated features, skewed data, imbalanced categorical data and etc. [1,39,50]. This method also involves application of automated correction algorithms to correct the data issues, such as the SMOTE technique [18 ] can be used to mitigate the class imbalance problem or removal of redundant data using automated algorithms.…”
Section: Data Configuration Mechanismsmentioning
confidence: 99%
“…The manual configuration approach enables domain experts, such as healthcare experts, to utilise their prior knowledge to assess the importance of predictor variables and mitigate bias or anomalies within the training data. In contrast, the automated configuration highlights potential issues in the training data [1,39,50] and allows users to select the issues that need correction. The system automatically applies correction algorithms to minimise these potential issues and retrains the prediction model on the configured data.…”
Section: Introductionmentioning
confidence: 99%
“…For example, radiologists working in the NHS breast screening programme are subject to a range of monitoring and auditing procedures (Cohen et al 2018). Data and concept 'drift' mean that an AI system's performance may also change over time (Davis et al 2017a, 2017b, Health 2022, raising the need for monitoring and auditing procedures and tools to detect changes that might put patient safety at risk (Ackerman et al 2020, Henne et al 2020, Nix et al 2022. Providing support for the monitoring and auditing of AI systems (Davis et al 2019, Liu et al 2022 would therefore be another scenario to be taken into consideration in the design of the system biography described above.…”
Section: Understanding Accountability As a Constraint And A Resource ...mentioning
confidence: 99%
“…Surrogate models (i.e. simplified proxies of a model; also called emulators) must be treated with care because some may miss important fringe–cases or rare events that more fine-grained models are able to better predict, such as in the case of machine learning algorithms deployed with insufficient training data [1, 2, 3, 5, 87].…”
Section: Building Multiscale Modelsmentioning
confidence: 99%