Improvements to Prosodic Alignment for Automatic Dubbing

Virkar, Yogesh; Federico, Marcello; Enyedi, Robert; Barra-Chicote, Roberto

doi:10.1109/icassp39728.2021.9414966

Cited by 11 publications

(20 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…In the past there has been little work to address isochrony in dubbing [6,7,8,9]. The approach of [6] involved generating and rescoring segmentation hypotheses by utilizing the attention weights of neural machine translation.…”

Section: Related Workmentioning

confidence: 99%

“…High quality video dubbing usually involves speech synchronization at the utterance level (isochrony), lip movement level (phonetic synchrony) and body movement level (kinetic synchrony). In the past, most work on AD [6,7,8,9] addressed isochrony, i.e., translating original speech by optimally matching its sequence of phrases and pauses. The idea is to first machine translate the source transcript by generating output with roughly the same duration [10,11] -i.e.…”

Section: Introductionmentioning

confidence: 99%

“…Past work on PA [6,7,8,9] focused on isochrony in the context of on-screen dubbing, i.e., dubbing of videos in which the speaker's mouth is visible for all utterances. However, in practical settings, it is quite common that videos contain scenes in which the speaker is not visible (off-screen) and for which the synchronization constraints of isochrone dubbing can be relaxed.…”

Section: Introductionmentioning

confidence: 99%

“…To address this case, we extend PA with a mechanism to address on/off-screen dubbing in which all or some of the sentences in a video are off-screen. We perform automatic and human evaluations that compare our original PA model for isochrone dubbing [9] with the augmented PA model for on/off dubbing 1 . We report results on a test set of TED talks extracted from the MUST-C corpus [12] and on 3 publicly available YouTube videos, on four dubbing directions, English (en) to French (fr), Italian (it), German (de) and Spanish (es).…”

Section: Introductionmentioning

confidence: 99%

“…We build on the automatic dubbing architecture presented in [8,7]. Figure 1 shows (in bold) how we extend a speech-tospeech translation [1,2,3] pipeline with: neural machine translation (MT) robust to ASR errors and able to control verbosity of the output [11,13,14]; prosodic alignment (PA) [6,8,9] which addresses phrase-level synchronization of the MT output by leveraging the force-aligned source transcript; neural text-to-speech (TTS) [15,16,17] with precise duration control; and, finally, audio rendering that enriches TTS output with the original background noise (extracted via audio source separation with deep U-Nets [18,19]) and reverberation, estimated from the original audio [20,21].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Prosodic Alignment for off-screen automatic dubbing

Virkar¹,

Federico²,

Enyedi³

et al. 2022

Preprint

Self Cite

View full text Add to dashboard Cite

The goal of automatic dubbing is to perform speech-to-speech translation while achieving audiovisual coherence. This entails isochrony, i.e., translating the original speech by also matching its prosodic structure into phrases and pauses, especially when the speaker's mouth is visible. In previous work, we introduced a prosodic alignment model to address isochrone or on-screen dubbing. In this work, we extend the prosodic alignment model to also address off-screen dubbing that requires less stringent synchronization constraints. We conduct experiments on four dubbing directions -English to French, Italian, German and Spanish -on a publicly available collection of TED Talks and on publicly available YouTube videos. Empirical results show that compared to our previous work the extended prosodic alignment model provides significantly better subjective viewing experience on videos in which on-screen and off-screen automatic dubbing is applied for sentences with speakers mouth visible and not visible, respectively.

show abstract