Interspeech 2017 2017
DOI: 10.21437/interspeech.2017-360
|View full text |Cite
|
Sign up to set email alerts
|

Audio Replay Attack Detection with Deep Learning Frameworks

Abstract: Nowadays spoofing detection is one of the priority research areas in the field of automatic speaker verification. The success of Automatic Speaker Verification Spoofing and Countermeasures (ASVspoof) Challenge 2015 confirmed the impressive perspective in detection of unforeseen spoofing trials based on speech synthesis and voice conversion techniques. However, there is a small number of researches addressed to replay spoofing attacks which are more likely to be used by non-professional impersonators. This pape… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
188
0

Year Published

2019
2019
2021
2021

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 236 publications
(189 citation statements)
references
References 13 publications
1
188
0
Order By: Relevance
“…We operate on power spectrograms instead of other alternative time-frequency representations, following findings in [25]. All our sub-CNNs use the architecture described in [26], which is an adapted version of the best performing model [25] in the ASVspoof 2017 challenge. It consists of 9 convolutional layers, 5 max-pooling layers and 2 fully connected (FC) layers.…”
Section: Proposed Methodologymentioning
confidence: 99%
See 1 more Smart Citation
“…We operate on power spectrograms instead of other alternative time-frequency representations, following findings in [25]. All our sub-CNNs use the architecture described in [26], which is an adapted version of the best performing model [25] in the ASVspoof 2017 challenge. It consists of 9 convolutional layers, 5 max-pooling layers and 2 fully connected (FC) layers.…”
Section: Proposed Methodologymentioning
confidence: 99%
“…The input to the network is a mean-variance normalized log power spectrogram of 3 seconds. This normalisation, motivated from [25], is performed at the utterance-level to standardize the features (zero mean and unit variance for all frequency bins) within a given recording. We use a 512-point fast Fourier transform (FFT), and a 32 ms window with a hop of 10 ms.…”
Section: Input Representation and Preprocessingmentioning
confidence: 99%
“…The main motivation to use the T frames at the beginning is to build fixed-length utterance-level countermeasure models, which is a commonly adopted design choice for anti-spoofing systems, e.g. [15,14].…”
Section: Features and Input Representationmentioning
confidence: 99%
“…When an automatic speaker verification (ASV) system is presented with fake speech intending to impersonate a live talker then this is commonly referred to as either "spoofing", a replay attack or a presentation attack. Building upon similar challenges from earlier years [1,2,3], detecting replay attacks was the basis for the Physical Access (PA) task of the ASVspoof 2019 Challenge [4]. The aim of our work was to detect whether or not a bona fide live speech recording had been intercepted and subsequently replayed back as non-live speech to an ASV system.…”
Section: Introductionmentioning
confidence: 99%
“…There are numerous variables to be modeled including elements of how the attack was conducted as well as the type of ASV system being attacked. Our approach captured these conditions using a convolutional neural network (CNN) architecture, similar to the top system from the earlier ASVspoof 2017 replay detection challenge [3]. We explored several types of signal features such as those from [2] and our final submission was based on our own specially-trained x-vector [5] embeddings combined with signal features.…”
Section: Introductionmentioning
confidence: 99%