VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Montesinos, Juan F.; Kadandale, Venkatesh S.; Haro, Gloria

doi:10.48550/arxiv.2203.04099

Cited by 3 publications

(3 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The other bottleneck is automated speech recognition and natural language understanding in case of background noise and other speakers. This technology is evolving quickly, see, e.g., [58] and the recent publications [59,60], but further improvements are needed.…”

Section: Discussionmentioning

confidence: 99%

AI Technologies for Machine Supervision and Help in a Rehabilitation Scenario

Baranyi

Melício

Gaál

et al. 2022

MTI

View full text Add to dashboard Cite

We consider, evaluate, and develop methods for home rehabilitation scenarios. We show the required modules for this scenario. Due to the large number of modules, the framework falls into the category of Composite AI. Our work is based on collected videos with high-quality execution and samples of typical errors. They are augmented by sample dialogues about the exercise to be executed and the assumed errors. We study and discuss body pose estimation technology, dialogue systems of different kinds and the emerging constraints of verbal communication. We demonstrate that the optimization of the camera and the body pose allows high-precision recording and requires the following components: (1) optimization needs a 3D representation of the environment, (2) a navigation dialogue to guide the patient to the optimal pose, (3) semantic and instance maps are necessary for verbal instructions about the navigation. We put forth different communication methods, from video-based presentation to chit-chat-like dialogues through rule-based methods. We discuss the methods for different aspects of the challenges that can improve the performance of the individual components. Due to the emerging solutions, we claim that the range of applications will drastically grow in the very near future.

show abstract

Section: Discussionmentioning

confidence: 99%

AI Technologies for Machine Supervision and Help in a Rehabilitation Scenario

Baranyi

Melício

Gaál

et al. 2022

MTI

View full text Add to dashboard Cite

show abstract

“…Transformers have emerged as powerful deep learning architectures capable of capturing long range dependencies in time series. Lately, transformers have been explored for several audio-visual tasks such as source separation [16,17], source localisation [18] and speech recognition [19], including synchronisation [13]. Our work in this paper is closest to Audio-Visual Synchronisation with Transformers (AVST) [13].…”

Section: Related Workmentioning

confidence: 99%

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

Kadandale¹,

Montesinos²,

Haro³

2022

Preprint

Self Cite

View full text Add to dashboard Cite

In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audiovisual correspondence score. We propose an audio-visual crossmodal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on the lip synchronisation in speech videos, we also consider the special case of singing voice. Singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code and the pre-trained model will be made available on https://ipcv.github.io/VocaLiST/

show abstract

“…Transformers have emerged as powerful deep learning architectures capable of capturing long-range dependencies in time series. Lately, transformers have been explored for several AV tasks such as source separation [16,17], source localisation [18] and speech recognition [19], including synchronisation [13]. Our work in this paper is closest to Audio-Visual Synchronisation with Transformers (AVST) [13].…”

Section: Related Workmentioning

confidence: 99%

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

Kadandale¹,

Montesinos²,

Haro³

2022

Interspeech 2022

View full text Add to dashboard Cite

In this paper, we address the problem of lip-voice synchronisation in videos containing human face and voice. Our approach is based on determining if the lips motion and the voice in a video are synchronised or not, depending on their audiovisual correspondence score. We propose an audio-visual crossmodal transformer-based model that outperforms several baseline models in the audio-visual synchronisation task on the standard lip-reading speech benchmark dataset LRS2. While the existing methods focus mainly on lip synchronisation in speech videos, we also consider the special case of the singing voice. The singing voice is a more challenging use case for synchronisation due to sustained vowel sounds. We also investigate the relevance of lip synchronisation models trained on speech datasets in the context of singing voice. Finally, we use the frozen visual features learned by our lip synchronisation model in the singing voice separation task to outperform a baseline audio-visual model which was trained end-to-end. The demos, source code, and the pre-trained models are available on https://ipcv.github.io/VocaLiST/

show abstract

VoViT: Low Latency Graph-based Audio-Visual Voice Separation Transformer

Cited by 3 publications

References 26 publications

AI Technologies for Machine Supervision and Help in a Rehabilitation Scenario

AI Technologies for Machine Supervision and Help in a Rehabilitation Scenario

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

VocaLiST: An Audio-Visual Synchronisation Model for Lips and Voices

Contact Info

Product

Resources

About