ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9053893
|View full text |Cite
|
Sign up to set email alerts
|

Detecting Multiple Speech Disfluencies Using a Deep Residual Network with Bidirectional Long Short-Term Memory

Abstract: Stuttering is a speech impediment affecting tens of millions of people on an everyday basis. Even with its commonality, there is minimal data and research on the identification and classification of stuttered speech. This paper tackles the problem of detection and classification of different forms of stutter. As opposed to most existing works that identify stutters with language models, our work proposes a model that relies solely on acoustic features, allowing for identification of several variations of stutt… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
81
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
2
1

Relationship

0
7

Authors

Journals

citations
Cited by 58 publications
(82 citation statements)
references
References 22 publications
0
81
0
1
Order By: Relevance
“…In most of the cases the overall system maintains its modular setting and therefore requires not only separate optimization criteria for different components but also in some cases hand-labeled features for the annotation of the disfluencies to train the language model. A recent work brings the focus on the acoustic side and does not take into consideration any language-dependent information [12]. To the best of the authors' knowledge, with the exception of a work on personalized ASR for dysarthric speech [13], none of the published papers are aimed to improve speech recognition accuracy of an E2E ASR system by dealing with disfluencies without solving disfluency detection task itself.…”
Section: Prior Workmentioning
confidence: 99%
“…In most of the cases the overall system maintains its modular setting and therefore requires not only separate optimization criteria for different components but also in some cases hand-labeled features for the annotation of the disfluencies to train the language model. A recent work brings the focus on the acoustic side and does not take into consideration any language-dependent information [12]. To the best of the authors' knowledge, with the exception of a work on personalized ASR for dysarthric speech [13], none of the published papers are aimed to improve speech recognition accuracy of an E2E ASR system by dealing with disfluencies without solving disfluency detection task itself.…”
Section: Prior Workmentioning
confidence: 99%
“…Our approach takes an audio clip, extracts acoustic features per-frame, applies a temporal model, and outputs a single set of clip-level dysfluency labels. We investigated baselines that are inspired by the dysfluency model in [10] and alternative input features, model architectures, and loss functions.…”
Section: Methodsmentioning
confidence: 99%
“…A major bottleneck in this area is that dysfluency datasets tend to be small and have few or inconsistent annotations not inherently designed for work on speech recognition tasks. Kourkounakis et al [10] used 800 speech clips (53 minutes) with custom annotations to detect dysfluencies from 25 children who stutter using the UCLASS dataset [11]. Riad et al [12] performed a similar task using 1429 utterances from 22 adults who stutter with the recent FluencyBank [13] dataset.…”
Section: Introductionmentioning
confidence: 99%
See 2 more Smart Citations