“…We build on the automatic dubbing architecture presented in [8,7]. Figure 1 shows (in bold) how we extend a speech-tospeech translation [1,2,3] pipeline with: neural machine translation (MT) robust to ASR errors and able to control verbosity of the output [11,13,14]; prosodic alignment (PA) [6,8,9] which addresses phrase-level synchronization of the MT output by leveraging the force-aligned source transcript; neural text-to-speech (TTS) [15,16,17] with precise duration control; and, finally, audio rendering that enriches TTS output with the original background noise (extracted via audio source separation with deep U-Nets [18,19]) and reverberation, estimated from the original audio [20,21].…”