Dataset Artefacts in Anti-Spoofing Systems: A Case Study on the ASVspoof 2017 Benchmark

Chettri, Bhusan; Benetos, Emmanouil; Sturm, Bob L.

doi:10.1109/taslp.2020.3036777

Cited by 24 publications

(21 citation statements)

References 54 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Using a fixed-size input has side-effects: trimming discards information in x K+1:N (j) , and padding propagates artifacts [18]. While chunking keeps all the information, the chunks are independently scored by the neural network.…”

Section: From Varied-length Input To Scorementioning

confidence: 99%

A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection

Wang¹,

Yamagishi²

2021

Interspeech 2021

View full text Add to dashboard Cite

A great deal of recent research effort on speech spoofing countermeasures has been invested into back-end neural networks and training criteria. We contribute to this effort with a comparative perspective in this study. Our comparison of countermeasure models on the ASVspoof 2019 logical access scenario takes into account common strategies to deal with input trials of varied length, recently proposed marginbased training criteria, and widely used front ends. We also measured intra-model differences through multiple trainingevaluation rounds with random initialization. Our statistical analysis demonstrates that the performance of the same model may be statistically significantly different when just changing the random initial seed. We thus recommend similar statistical analysis or reporting results of multiple runs for further research on the database. Despite the intra-model differences, we observed a few promising techniques, including average pooling, to efficiently process varied-length inputs and a new hyper-parameter-free loss function. The two techniques led to the best single model in our experiment, which achieved an equal error rate of 1.92% and was significantly different in statistical sense from most of the other experimental models.

show abstract

Section: From Varied-length Input To Scorementioning

confidence: 99%

A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection

Wang¹,

Yamagishi²

2021

Interspeech 2021

View full text Add to dashboard Cite

show abstract

“…The above trim-and-pad strategy either discards useful information or adds redundancy to a 1:N (j) [16]. When N (j) < N , many models simply take the first N frames by setting a 1:N = a 1:N [55].…”

Section: Fixed-input-size Backmentioning

confidence: 99%

“…In both cases, only part of the information in a 1:N (j) is consumed by the back end. When N (j) > N , zero-padding or duplication can be used, but it causes artifacts [16].…”

Section: Fixed-input-size Backmentioning

confidence: 99%

A Practical Guide to Logical Access Voice Presentation Attack Detection

Wang¹,

Yamagishi²

2022

Preprint

View full text Add to dashboard Cite

Voice-based human-machine interfaces with an automatic speaker verification (ASV) component are commonly used in the market. However, the threat from presentation attacks is also growing since attackers can use recent speech synthesis technology to produce a naturalsounding voice of a victim. Presentation attack detection (PAD) for ASV, or speech anti-spoofing, is therefore indispensable. Research on voice PAD has seen significant progress since the early 2010s, including the advancement in PAD models, benchmark datasets, and evaluation campaigns. This chapter presents a practical guide to the field of voice PAD, with a focus on logical access attacks using text-to-speech and voice conversion algorithms and spoofing countermeasures based on artifact detection. It introduces the basic concept of voice PAD, explains the common techniques, and provides an experimental study using recent methods on a benchmark dataset. Code for the experiments is open-sourced.

show abstract

“…For example, [8] find that the length of silence in the PA part of ASVspoof 2019 differs between bonafide and spoof data. Additionally, for the ASVspoof 2017 replay detection challenge, [14] observe that the presence of certain attributes such as 'clicks', trailing and leading silence in the audio may hint at the target label. However, a thorough examination of the impact of the silence present in ASVspoof 2019 has, to the best of our knowledge, not been presented before.…”

Section: Related Workmentioning

confidence: 99%

Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?

Müller¹,

Dieckmann²,

Pavel³

et al. 2021

2021 Edition of the Automatic Speaker Verification and Spoofing Countermeasures Challenge

View full text Add to dashboard Cite

The ASVspoof Dataset is one of the most established datasets for training and benchmarking systems designed for the detection of spoofed audio and audio deepfakes. However, we observe an uneven distribution of silence length in dataset's training and test data, which hints at the target label: Bona-fide instances tend to have significantly longer leading + trailing silences than spoofed instances. This could be problematic, since a model may learn to only, or at least partially, base its decision on the length of the silence (similar to the issue with the Pascal VOC 2007 dataset, where all images of horses also contained a specific watermark [1]). In this paper, we explore this phenomenon in depth. We train a number of networks on only a) the length of the leading silence and b) with and without leading + trailing silence. Results show that models trained on only the length of the leading silence perform suspiciously well: They achieve up to 85% percent accuracy and an equal error rate (EER) of 0.15 on the 'eval' split of the data. Conversely, when training strong models on the full audio files, we observe that trimming silence during preprocessing dramatically worsens performance (EER increases from 0.03 to 0.15). This could indicate that previous work may, in part, have learned only to classify targets based on the length of silence. Consequently, it could mean that spoofing detection may not be as advanced as previous high-scores have led to believe. We hope that by sharing these results, the ASV community can further evaluate this phenomenon.

show abstract

Dataset Artefacts in Anti-Spoofing Systems: A Case Study on the ASVspoof 2017 Benchmark

Cited by 24 publications

References 54 publications

A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection

A Comparative Study on Recent Neural Spoofing Countermeasures for Synthetic Speech Detection

A Practical Guide to Logical Access Voice Presentation Attack Detection

Speech is Silver, Silence is Golden: What do ASVspoof-trained Models Really Learn?

Contact Info

Product

Resources

About