Proceedings of the 17th International Conference on Spoken Language Translation 2020
DOI: 10.18653/v1/2020.iwslt-1.31
|View full text |Cite
|
Sign up to set email alerts
|

From Speech-to-Speech Translation to Automatic Dubbing

Abstract: We present enhancements to a speech-tospeech translation pipeline in order to perform automatic dubbing. Our architecture features neural machine translation generating output of preferred length, prosodic alignment of the translation with the original speech segments, neural text-to-speech with fine tuning of the duration of each utterance, and, finally, audio rendering to enriches text-to-speech output with background noise and reverberation extracted from the original audio. We report and discuss results of… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
4
1

Citation Types

0
39
0
1

Year Published

2020
2020
2022
2022

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 21 publications
(40 citation statements)
references
References 27 publications
0
39
0
1
Order By: Relevance
“…This mode allows access to future context, and imposes no strict computational restrictions. Typical applications include movie subtitling (Matusov et al, 2019) and dubbing (Saboo and Baumann, 2019;Federico et al, 2020).…”
Section: Mode Of Deliverymentioning
confidence: 99%
“…This mode allows access to future context, and imposes no strict computational restrictions. Typical applications include movie subtitling (Matusov et al, 2019) and dubbing (Saboo and Baumann, 2019;Federico et al, 2020).…”
Section: Mode Of Deliverymentioning
confidence: 99%
“…A major requirement of dubbing is speech synchronization which, in order of priority, should happen at the utterance level (isochrony), lip movement level (lip synchrony), and body movement level (kinetic synchrony) [5]. Most of the work on AD [6,7,8], including this one, addresses isochrony, which aims to generate translations and utterances that match the phrase-pause arrangement of the original audio. Given a source sentence transcript, the first step is to generate a translation of more or less the same "duration" [9,10], e.g.…”
Section: Introductionmentioning
confidence: 99%
“…number of characters or syllables. The second step, called prosodic alignment (PA) [6,7,8], segments the translation into phrases and pauses of the same duration of the original phrases.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation
“…Öktem et al (2019) use the NMT attention mechanism to segment the translation into prosodic phrases in order to improve a Text-to-Speech system for dubbing. Federico et al (2020) adapt an NMT system to generate translations of the same length as the source, although in terms of characters, which does not necessarily reflect duration of utterance. However, none of these works has taken into consideration the on/off-screen dichotomy.…”
Section: Introductionmentioning
confidence: 99%