2022
DOI: 10.48550/arxiv.2203.04099
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Abstract: This paper presents an audio-visual approach for voice separation which outperforms state-of-theart methods at a low latency in two scenarios: speech and singing voice. The model is based on a two-stage network. Motion cues are obtained with a lightweight graph convolutional network that processes face landmarks. Then, both audio and motion features are fed to an audio-visual transformer which produces a fairly good estimation of the isolated target source. In a second stage, the predominant voice is enhanced … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2022
2022
2022
2022

Publication Types

Select...
2
1

Relationship

1
2

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 26 publications
0
3
0
Order By: Relevance
“…The other bottleneck is automated speech recognition and natural language understanding in case of background noise and other speakers. This technology is evolving quickly, see, e.g., [58] and the recent publications [59,60], but further improvements are needed.…”
Section: Discussionmentioning
confidence: 99%
“…The other bottleneck is automated speech recognition and natural language understanding in case of background noise and other speakers. This technology is evolving quickly, see, e.g., [58] and the recent publications [59,60], but further improvements are needed.…”
Section: Discussionmentioning
confidence: 99%
“…Transformers have emerged as powerful deep learning architectures capable of capturing long range dependencies in time series. Lately, transformers have been explored for several audio-visual tasks such as source separation [16,17], source localisation [18] and speech recognition [19], including synchronisation [13]. Our work in this paper is closest to Audio-Visual Synchronisation with Transformers (AVST) [13].…”
Section: Related Workmentioning
confidence: 99%
“…Transformers have emerged as powerful deep learning architectures capable of capturing long-range dependencies in time series. Lately, transformers have been explored for several AV tasks such as source separation [16,17], source localisation [18] and speech recognition [19], including synchronisation [13]. Our work in this paper is closest to Audio-Visual Synchronisation with Transformers (AVST) [13].…”
Section: Related Workmentioning
confidence: 99%