Interspeech 2018 2018
DOI: 10.21437/interspeech.2018-1240
|View full text |Cite
|
Sign up to set email alerts
|

On Learning to Identify Genders from Raw Speech Signal Using CNNs

Abstract: Automatic Gender Recognition (AGR) is the task of identifying the gender of a speaker given a speech signal. Standard approaches extract features like fundamental frequency and cepstral features from the speech signal and train a binary classifier. Inspired from recent works in the area of automatic speech recognition (ASR), speaker recognition and presentation attack detection, we present a novel approach where relevant features and classifier are jointly learned from the raw speech signal in end-to-end manne… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
25
0
6

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
2
1

Relationship

2
6

Authors

Journals

citations
Cited by 38 publications
(34 citation statements)
references
References 20 publications
1
25
0
6
Order By: Relevance
“…In this paper, rather than extracting voice source related features from speech signals and then modelling them through a classifier, we develop methods to directly learn voice source related information in an end-to-end manner for depression detection. This is motivated from recent works that have shown that CNNs can learn task dependent information from raw speech signals in an end-to-end manner [20,21,22,23]. Specifically, we show that, by combining prior knowledge based signal processing and the capability of CNNs to learn task dependent information from raw signals, depression can be effectively detected from the voice source information.…”
Section: Introductionmentioning
confidence: 76%
“…In this paper, rather than extracting voice source related features from speech signals and then modelling them through a classifier, we develop methods to directly learn voice source related information in an end-to-end manner for depression detection. This is motivated from recent works that have shown that CNNs can learn task dependent information from raw speech signals in an end-to-end manner [20,21,22,23]. Specifically, we show that, by combining prior knowledge based signal processing and the capability of CNNs to learn task dependent information from raw signals, depression can be effectively detected from the voice source information.…”
Section: Introductionmentioning
confidence: 76%
“…Numerous attempts have been made to uncover links between speech parameters and speaker demographics [26,34,48,92]. A person's gender, for instance, can be reflected in voice onset time, articulation, and duration of vowels, which is due to various reasons, including differences in vocal fold anatomy, vocal tract dimensions, hormone levels, and sociophonetic factors [92].…”
Section: Inference Of Age and Gendermentioning
confidence: 99%
“…It has also been shown that male and female speakers differ measurably in word use [26]. Like humans, computer algorithms can identify the sex of a speaker from a voice sample with high accuracy [48]. Precise classification results are achieved even under adverse conditions, such as loud background noise or emotional and intoxicated speech [34].…”
Section: Inference Of Age and Gendermentioning
confidence: 99%
“…In [12], a spectral dictionary interpretation was proposed to understand the information modeled by the first convolution layer. This approach has been applied in other studies, such as [23] and [24], to understand the spectral information modeled by the CNNs. In this approach, the spectral response of the filters to the input speech is calculated in the following manner: 1) s c t was taken as the input speech segment.…”
Section: Analysis Based On Spectral Dictionary Interpretationmentioning
confidence: 99%