ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2020
DOI: 10.1109/icassp40776.2020.9054180
|View full text |Cite
|
Sign up to set email alerts
|

Deep Audio-Visual Speech Separation with Attention Mechanism

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
36
0
1

Year Published

2021
2021
2023
2023

Publication Types

Select...
4
1
1

Relationship

0
6

Authors

Journals

citations
Cited by 31 publications
(38 citation statements)
references
References 21 publications
1
36
0
1
Order By: Relevance
“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( singing at two levels of intensity VoxCeleb2 [38] 6,112 Continuous sentences Not specified b Not specified b Videos in the wild from [7], [42], [122], [128], [153], [ Challenging visual and auditory environments LRS2 [8] Hundreds Continuous sentences Not specified b Not specified b Videos in the wild [7], [10], [107], [153]…”
Section: Audio-visual Corporamentioning
confidence: 99%
See 3 more Smart Citations
“…Nevertheless, the datasets collected in controlled environments are a good choice for training a prototype designed for a specific purpose or for studying a particular problem. Examples of databases useful in this sense are: TCD-TIMIT [89] and OuluVS2 [16], to study the influence of several angles of view; MODALITY [46] and OuluVS2 [16], to determine the effect of different video frame rates; Lombard GRID [13], to understand the impact of the Lombard effect, also from several angles of view; RAVDESS [161], to perform a study of emotions in the context of SE and SS; KinectDigits [224] and MODALITY [46], to determine [3], [5], [6], [17], [65], [66], [76], [77] (1,000 of 3 seconds 25 FPS Frontal face [108], [136], [154], [164], [165], [263] per speaker) [176], [183], [195], [203], [239] OuluVS [286] 20 ( singing at two levels of intensity VoxCeleb2 [38] 6,112 Continuous sentences Not specified b Not specified b Videos in the wild from [7], [42], [122], [128], [153], [ Challenging visual and auditory environments LRS2 [8] Hundreds Continuous sentences Not specified b Not specified b Videos in the wild [7], [10], [107], [153]…”
Section: Audio-visual Corporamentioning
confidence: 99%
“…This means that in order to have good performance in a wide variety of settings, very large AV datasets for training and testing need to be collected. In practice, the systems are trained using a large number of complex acoustic [66], [76], [77], [85], [99], [122], [128], [164], [165], [176], [178], [179], [220]- [222], [244], [263], [274], [ [17], [65], [154], [164], [165] Landmark-based features [100], [154], [183], [203] Multisensory features [195] Face recognition embedding [55], [109], [169], [192], [239] VSR embedding [7], [10], [107]- [109], [153], [222], [273] Facial appearance embedding [42], [208] Compressed mouth frames [37] Speaker direction [85], [244], [279] Acoustic Features…”
Section: Audio-visual Speech Enhancement and Separation Systemsmentioning
confidence: 99%
See 2 more Smart Citations
“…Although many studies have addressed this problem [5,6], they still cannot assign the correct speaker labels to the separated signals with similar vocal characteristics [8]. On the other hand, recent studies have examined audio-visual speech separation [8][9][10][11] and enhancement [12][13][14][15][16] that use the visual information as auxiliary input to the audio-visual speech separation model. The most common visual information used in these studies is facial movements of the speaker.…”
Section: Introductionmentioning
confidence: 99%