2022
DOI: 10.48550/arxiv.2204.02530
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Prosodic Alignment for off-screen automatic dubbing

Abstract: The goal of automatic dubbing is to perform speech-to-speech translation while achieving audiovisual coherence. This entails isochrony, i.e., translating the original speech by also matching its prosodic structure into phrases and pauses, especially when the speaker's mouth is visible. In previous work, we introduced a prosodic alignment model to address isochrone or on-screen dubbing. In this work, we extend the prosodic alignment model to also address off-screen dubbing that requires less stringent synchroni… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(1 citation statement)
references
References 18 publications
0
1
0
Order By: Relevance
“…Existing video dubbing works ( Öktem, Farrús, and Bonafonte 2019;Federico et al 2020b;Lakew et al 2021;Virkar et al 2021;Sharma et al 2021;Effendi et al 2022;Lakew et al 2022;Virkar et al 2022;Tam et al 2022) are usually based on a cascaded speech-to-speech translation system (Federico et al 2020a) with ad-hoc designs, mainly concentrating on the Neural Machine Translation (NMT) and Text-To-Speech (TTS) stages. In the NMT stage, related works achieve the length control by assuming that similar number of words/characters should have similar speech length, and therefore encourage a model to generate target sequence with similar number of words/characters to the source sequence (Federico et al 2020a).…”
Section: Introductionmentioning
confidence: 99%
“…Existing video dubbing works ( Öktem, Farrús, and Bonafonte 2019;Federico et al 2020b;Lakew et al 2021;Virkar et al 2021;Sharma et al 2021;Effendi et al 2022;Lakew et al 2022;Virkar et al 2022;Tam et al 2022) are usually based on a cascaded speech-to-speech translation system (Federico et al 2020a) with ad-hoc designs, mainly concentrating on the Neural Machine Translation (NMT) and Text-To-Speech (TTS) stages. In the NMT stage, related works achieve the length control by assuming that similar number of words/characters should have similar speech length, and therefore encourage a model to generate target sequence with similar number of words/characters to the source sequence (Federico et al 2020a).…”
Section: Introductionmentioning
confidence: 99%