2019
DOI: 10.1007/978-3-030-37734-2_60
|View full text |Cite
|
Sign up to set email alerts
|

An Attention Based Speaker-Independent Audio-Visual Deep Learning Model for Speech Enhancement

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
12
0
1

Year Published

2019
2019
2022
2022

Publication Types

Select...
2
1
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(13 citation statements)
references
References 13 publications
0
12
0
1
Order By: Relevance
“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( singing at two levels of intensity VoxCeleb2 [38] 6,112 Continuous sentences Not specified b Not specified b Videos in the wild from [7], [42], [122], [128], [153], [ Challenging visual and auditory environments LRS2 [8] Hundreds Continuous sentences Not specified b Not specified b Videos in the wild [7], [10], [107], [153]…”
Section: Audio-visual Corporamentioning
confidence: 99%
See 3 more Smart Citations
“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( singing at two levels of intensity VoxCeleb2 [38] 6,112 Continuous sentences Not specified b Not specified b Videos in the wild from [7], [42], [122], [128], [153], [ Challenging visual and auditory environments LRS2 [8] Hundreds Continuous sentences Not specified b Not specified b Videos in the wild [7], [10], [107], [153]…”
Section: Audio-visual Corporamentioning
confidence: 99%
“…This means that in order to have good performance in a wide variety of settings, very large AV datasets for training and testing need to be collected. In practice, the systems are trained using a large number of complex acoustic [66], [76], [77], [85], [99], [122], [128], [164], [165], [176], [178], [179], [220]- [222], [244], [263], [274], [ [17], [65], [154], [164], [165] Landmark-based features [100], [154], [183], [203] Multisensory features [195] Face recognition embedding [55], [109], [169], [192], [239] VSR embedding [7], [10], [107]- [109], [153], [222], [273] Facial appearance embedding [42], [208] Compressed mouth frames [37] Speaker direction [85], [244], [279] Acoustic Features…”
Section: Audio-visual Speech Enhancement and Separation Systemsmentioning
confidence: 99%
See 2 more Smart Citations
“…[7], [10], [153] Complex spectrogram [55], [107], [109], [169], [239] Raw waveform [108], [273] Speaker embeddings [10], [85], [169], [192], [208]…”
Section: Audio-visual Speech Enhancement and Separation Systemsmentioning
confidence: 99%