ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2021
DOI: 10.1109/icassp39728.2021.9413491
|View full text |Cite
|
Sign up to set email alerts
|

Audio-Visual Speech Separation Using Cross-Modal Correspondence Loss

Abstract: We present an audio-visual speech separation learning method that considers the correspondence between the separated signals and the visual signals to reflect the speech characteristics during training. Audio-visual speech separation is a technique to estimate the individual speech signals from a mixture using the visual signals of the speakers. Conventional studies on audio-visual speech separation mainly train the separation model on the audio-only loss, which reflects the distance between the source signals… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 31 publications
0
5
0
Order By: Relevance
“…Makishima et al [21] proposed a model with two subnetworks for visual and auditory inputs. Embeddings are produced through different sub-networks, which are subsequently integrated and processed by a decoder network.…”
Section: Related Workmentioning
confidence: 99%
“…Makishima et al [21] proposed a model with two subnetworks for visual and auditory inputs. Embeddings are produced through different sub-networks, which are subsequently integrated and processed by a decoder network.…”
Section: Related Workmentioning
confidence: 99%
“…Makishima et al [38] proposed a deep learning method that utilizes both auditory and visual information to separate speech signals from an audio-visual mixture. It consists of two subnetworks, a visual network and an auditory network, which generate embeddings that are concatenated and passed through a decoder network.…”
Section: Recent Avss Workmentioning
confidence: 99%
“…One advanced method for audio-visual source separation involves the use of deep learning techniques [10][11][12]14,22,34,[37][38].…”
Section: Introductionmentioning
confidence: 99%
“…Most recent algorithms made use of lips motion as well as appearance information, usually implementing cross-modal losses to pull together corresponding audio-visual features Gao and Grauman [2021], Makishima et al [2021].…”
Section: Related Workmentioning
confidence: 99%