New Era for Robust Speech Recognition 2017
DOI: 10.1007/978-3-319-64680-0_14
|View full text |Cite
|
Sign up to set email alerts
|

The CHiME Challenges: Robust Speech Recognition in Everyday Environments

Abstract: The CHiME challenge series has been aiming to advance the development of robust automatic speech recognition for use in everyday environments by encouraging research at the interface of signal processing and statistical modelling. The series has been running since 2011 and is now entering its 4th iteration. This chapter provides an overview of the CHiME series including a description of the datasets that have been collected and the tasks that have been defined for each edition. In particular the chapter descri… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2018
2018
2024
2024

Publication Types

Select...
5
3
1

Relationship

0
9

Authors

Journals

citations
Cited by 19 publications
(7 citation statements)
references
References 24 publications
0
7
0
Order By: Relevance
“…This model (referred to as "d-vector V2" in [13]) has a 3.06% equal error rate (EER) on our internal en-US phone audio test dataset, compared to the 3.55% EER of the one reported in [10]. VoiceFilter: We cannot use a "standard" benchmark corpus for speech separation, such as one of the CHiME challenges [19], because we need a clean reference utterance of each target speaker in order to compute speaker embeddings. Instead, we train and evaluate the VoiceFilter system on our own generated data, derived either from the VCTK dataset [20] or from the LibriSpeech dataset [16].…”
Section: Datasetsmentioning
confidence: 99%
“…This model (referred to as "d-vector V2" in [13]) has a 3.06% equal error rate (EER) on our internal en-US phone audio test dataset, compared to the 3.55% EER of the one reported in [10]. VoiceFilter: We cannot use a "standard" benchmark corpus for speech separation, such as one of the CHiME challenges [19], because we need a clean reference utterance of each target speaker in order to compute speaker embeddings. Instead, we train and evaluate the VoiceFilter system on our own generated data, derived either from the VCTK dataset [20] or from the LibriSpeech dataset [16].…”
Section: Datasetsmentioning
confidence: 99%
“…However, when operating on ASR transcripts (including recognition errors), the speech-based models were competitive in performance with the text-based models. In particular, prior work has found that WER of ≈ 30% is typical for modern ASR in many real-world settings or without good-quality microphones (Lasecki et al, 2012;Barker et al, 2017). When operating on such ASR output, the RMS error of the speech-based model and the text-based model were comparable.…”
Section: Modelmentioning
confidence: 91%
“…Therefore, several speech corpora were recorded. For instance, the CHiME corpora [11] are made of English speech recordings in different noise conditions. In particular, the CHiME-5 data set is composed of recordings of 4-person dinner parties (host couple and guests).…”
Section: State Of the Art: Available Corporamentioning
confidence: 99%