ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022
DOI: 10.1109/icassp43922.2022.9747237
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Multi-Channel Speech Separation, Dereverberation and Recognition

Abstract: Accurate recognition of cocktail party speech containing overlapping speakers, noise and reverberation remains a highly challenging task to date. Motivated by the invariance of visual modality to acoustic signal corruption, an audio-visual multi-channel speech separation, dereverberation and recognition approach featuring a full incorporation of visual information into all system components is proposed in this paper. The efficacy of the video input is consistently demonstrated in maskbased MVDR speech separati… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
4
2

Relationship

0
6

Authors

Journals

citations
Cited by 6 publications
(9 citation statements)
references
References 72 publications
0
9
0
Order By: Relevance
“…Li et al [22] created a novel audio-visual deep learning technique that combines auditory and visual data to detect speech from many channels. The separation filters that extract the desired speech from a mixed input of microphones and video frames are constructed by a neural network.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…Li et al [22] created a novel audio-visual deep learning technique that combines auditory and visual data to detect speech from many channels. The separation filters that extract the desired speech from a mixed input of microphones and video frames are constructed by a neural network.…”
Section: Related Workmentioning
confidence: 99%
“…Li et al [22] Estimates target speech separation filters from multiple microphones and video frames. A multi-task framework addresses dereverberation and voice recognition tasks.…”
Section: Makishima Et Al [21]mentioning
confidence: 99%
“…Li et al [14] Proposed an AV deep learning approach for multi-channel speech separation by jointly modeling audio-visual cues. It includes a neural network that estimates separation filters for target speech from multiple microphones and video frames, and a multi-task framework for dereverberation and speech recognition.…”
Section: Recent Avss Workmentioning
confidence: 99%
“…One advanced method for audio-visual source separation involves the use of deep learning techniques [10][11][12]14,22,34,[37][38].…”
Section: Introductionmentioning
confidence: 99%
“…In recent years, end-to-end DNN-based microphone array beamforming techniques represented by a) neural timefrequency (TF) masking approaches [7]; b) neural Filter and Sum methods [8,9]; and c) mask-based MVDR [10] and generalized eigenvalues (GEV) [11] approaches have been widely adopted. In addition, incorporating visual information into either multi speech separation front-ends alone [12], or further into speech recognition back-ends [13], can further improve the overall system performance.…”
Section: Introductionmentioning
confidence: 99%