2021
DOI: 10.48550/arxiv.2106.11769
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism

Haiyang Liu,
Jihan Zhang

Abstract: Speech production is a dynamic procedure, which involved multi human organs including the tongue, jaw and lips. Modeling the dynamics of the vocal tract deformation is a fundamental problem to understand the speech, which is the most common way for human daily communication. Researchers employ several sensory streams to describe the process simultaneously, which are incontrovertibly statistically related to other streams. In this paper, we address the following question: given an observable image sequences of … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1

Citation Types

0
2
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
1

Relationship

0
1

Authors

Journals

citations
Cited by 1 publication
(2 citation statements)
references
References 30 publications
0
2
0
Order By: Relevance
“…These transformations lead to a vector-based connection notation for cascaded features between x l and g within the intermediary space of R F int . The AG's output merges input elements with attention coefficients through elementwise multiplication, as formulaically represented in Equation (3). In this study, the AG calculates a singular scalar focus value for each pixel vector x l i ∈ R F l , with F l indicating the number of feature maps at layer l.…”
Section: Gated Attentionmentioning
confidence: 99%
See 1 more Smart Citation
“…These transformations lead to a vector-based connection notation for cascaded features between x l and g within the intermediary space of R F int . The AG's output merges input elements with attention coefficients through elementwise multiplication, as formulaically represented in Equation (3). In this study, the AG calculates a singular scalar focus value for each pixel vector x l i ∈ R F l , with F l indicating the number of feature maps at layer l.…”
Section: Gated Attentionmentioning
confidence: 99%
“…Research indicates that tongue contours serve as an invaluable foundation for the quantitative analysis of speech, with data obtained from these contours facilitating the advancement and comprehension of speech models [ 3 , 4 ]. Ultrasonic tongue contour extraction can dynamically capture the tongue’s position across various phonetic expressions and depict the movements responsible for sound transitions during articulation [ 5 ].…”
Section: Introductionmentioning
confidence: 99%