2021
DOI: 10.48550/arxiv.2109.05257
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Towards a Rigorous Evaluation of Time-series Anomaly Detection

Abstract: In recent years, proposed studies on time-series anomaly detection (TAD) report high F1 scores on benchmark TAD datasets, giving the impression of clear improvements. However, most studies apply a peculiar evaluation protocol called point adjustment (PA) before scoring. In this paper, we theoretically and experimentally reveal that the PA protocol has a great possibility of overestimating the detection performance; that is, even a random anomaly score can easily turn into a state-of-the-art TAD method. Therefo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 17 publications
0
3
0
Order By: Relevance
“…Intrinsic Anomaly Detection Performance Acknowledging recent concern on the validity of time series anomaly benchmarks (Kim et al, 2021), our main quantitative results report AUROC scores on the intrinsic anomaly detection task. For the ResThresh method, we compute the AUROC score based on the maximum residual value over the full residual series.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…Intrinsic Anomaly Detection Performance Acknowledging recent concern on the validity of time series anomaly benchmarks (Kim et al, 2021), our main quantitative results report AUROC scores on the intrinsic anomaly detection task. For the ResThresh method, we compute the AUROC score based on the maximum residual value over the full residual series.…”
Section: Quantitative Resultsmentioning
confidence: 99%
“…For each method, we perform a search for the best F 1 -score by considering all unique scores a method generated for the data set at hand as decision thresholds. We are well aware of the valid criticism of this evaluation method by Kim et al [30], however we propose here to demonstrate that our method performs reasonably rather than set a new state-of-the-art result. We think that for our purposes this metric is sufficient.…”
Section: Detection Performancementioning
confidence: 88%
“…We chose not to evaluate this setting as it has been shown that current benchmarks can be dominated by trivial methods. Particularly, Kim et al (2021) show that a simple approach of using the L 2 norm of the raw features is sufficient for achieving competitive or better than the state-of-the-art on popular datasets such as WADI or SMAP. Furthermore, very strong results using the point-adjust protocol can be achieved using a random baseline.…”
Section: Discussionmentioning
confidence: 99%