11th ISCA Speech Synthesis Workshop (SSW 11) 2021
DOI: 10.21437/ssw.2021-10
|View full text |Cite
|
Sign up to set email alerts
|

Adaptation of Tacotron2-based Text-To-Speech for Articulatory-to-Acoustic Mapping using Ultrasound Tongue Imaging

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
2
1

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(3 citation statements)
references
References 0 publications
0
3
0
Order By: Relevance
“…While this simple arrangement already performs reasonably well, significant improvement can be achieved by involving the input context, that is, by using a block of video frames as input instead of just one image. Several network architectures have been proposed to process 3D blocks of input data, for video processing in general [24,25,26], and for ultrasound input in particular [12,14,27,16,28]. In the experimental section we will experiment both with 2D and 3D Convolutional Neural Networks (CNNs) for the mapping task.…”
Section: The Uti-to-speech Frameworkmentioning
confidence: 99%
See 1 more Smart Citation
“…While this simple arrangement already performs reasonably well, significant improvement can be achieved by involving the input context, that is, by using a block of video frames as input instead of just one image. Several network architectures have been proposed to process 3D blocks of input data, for video processing in general [24,25,26], and for ultrasound input in particular [12,14,27,16,28]. In the experimental section we will experiment both with 2D and 3D Convolutional Neural Networks (CNNs) for the mapping task.…”
Section: The Uti-to-speech Frameworkmentioning
confidence: 99%
“…Ideally, these interfaces would record the articulation and synthesize speech based on the movement of the organs -without the user of the device actually producing any sound. The typical input of AAM can be a video of the lip movements [3,4,5,6,7,8], ultrasound tongue imaging (UTI) [3,9,10,11,12,13,14,15,16,17], or several other modalities (e.g., MRI, EMA, PMA, EOS, radar, multimodal, etc.). All of the articulatory tracking devices are highly sensitive to 1) the alignment of the recording equipment across sessions, 2) the actual speaker's anatomy.…”
Section: Introductionmentioning
confidence: 99%
“…While this simple arrangement already performs reasonably well, significant improvement can be achieved by involving the input context, that is, by using a block of video frames as input instead of just one image. Several network architectures have been proposed to process 3D blocks of input data, for video processing in general [24,48,104], and for ultrasound input in particular [53,86,102,113]. In the experimental section we will experiment both with 2D and 3D Convolutional Neural Networks (CNNs) for the mapping task.…”
Section: The Uti-to-speech Frameworkmentioning
confidence: 99%