2021
DOI: 10.48550/arxiv.2104.05017
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Estimating articulatory movements in speech production with transformer networks

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
2

Citation Types

0
5
0

Year Published

2022
2022
2023
2023

Publication Types

Select...
3

Relationship

0
3

Authors

Journals

citations
Cited by 3 publications
(5 citation statements)
references
References 0 publications
0
5
0
Order By: Relevance
“…1, we use a transformer architecture to represent the input phoneme sequence. Transformers are an ideal choice since they have been shown to perform well in learning text features, especially in multi-modal scenarios [15,16]. We then use a custom convolutional decoder consisting of 2D and 3D convolutional neural networks (CNN) to represent the frames.…”
Section: Proposed Methodologymentioning
confidence: 99%
See 1 more Smart Citation
“…1, we use a transformer architecture to represent the input phoneme sequence. Transformers are an ideal choice since they have been shown to perform well in learning text features, especially in multi-modal scenarios [15,16]. We then use a custom convolutional decoder consisting of 2D and 3D convolutional neural networks (CNN) to represent the frames.…”
Section: Proposed Methodologymentioning
confidence: 99%
“…Deep learning methods are appropriate for this task since the models can learn across long sequences of arbitrarily different lengths, and can learn between modalities and generalise well to unseen situations. In recent years, transformer neural networks have been shown to perform well on various speech-based tasks involving sequence-to-sequence modelling such as speech synthesis [15], acoustic to articulatory inversion [16] etc. Self-attention operation in transformer networks enables learning the dependencies between positions in a sequence, making it suitable for learning good phoneme-level features.…”
Section: Introductionmentioning
confidence: 99%
“…1, we use a transformer architecture to represent the input phoneme sequence. Transformers are an ideal choice since they have been shown to perform well in learning text features, especially in multi-modal scenarios [15,16]. We then use a custom convolutional decoder consisting of 2D and 3D convolutional neural networks (CNN) to represent the frames.…”
Section: Proposed Methodologymentioning
confidence: 99%
“…Deep learning methods are appropriate for this task since the models can learn across long sequences of arbitrarily different lengths, and can learn between modalities and generalise well to unseen situations. In recent years, transformer neural networks have been shown to perform well on various speech-based tasks involving sequence-to-sequence modelling such as speech synthesis [15], acoustic to articulatory inversion [16] etc. Self-attention operation in transformer networks enables learning the dependencies between positions in a sequence, making it suitable for learning good phoneme-level features.…”
Section: Introductionmentioning
confidence: 99%
“…Details in Sections 2.1 and 3.1. 20,21,22]. Since these methods are data-driven and depend on the limited amount of articulatory data, they are unable to sufficiently generalize to unseen speakers to our knowledge.…”
Section: Introductionmentioning
confidence: 99%