2020
DOI: 10.1007/978-3-030-61401-0_16
|View full text |Cite
|
Sign up to set email alerts
|

3D Convolutional Neural Networks for Ultrasound-Based Silent Speech Interfaces

Abstract: Silent speech interfaces (SSI) aim to reconstruct the speech signal from a recording of the articulatory movement, such as an ultrasound video of the tongue. Currently, deep neural networks are the most successful technology for this task. The efficient solution requires methods that do not simply process single images, but are able to extract the tongue movement information from a sequence of video frames. One option for this is to apply recurrent neural structures such as the long short-term memory network (… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

2
18
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
3
3
1

Relationship

3
4

Authors

Journals

citations
Cited by 8 publications
(20 citation statements)
references
References 32 publications
2
18
0
Order By: Relevance
“…This parameter allows us to analyze input blocks that are placed at bigger time intervals. In an earlier study we processed ultrasound videos which had a much larger frame rate, and the optimal value for sts was found to be 5 [32]. Here, we got the best performance with sts = 3, which is reasonable as the frame rate was much lower.…”
Section: D-cnn+bilstmsupporting
confidence: 55%
See 2 more Smart Citations
“…This parameter allows us to analyze input blocks that are placed at bigger time intervals. In an earlier study we processed ultrasound videos which had a much larger frame rate, and the optimal value for sts was found to be 5 [32]. Here, we got the best performance with sts = 3, which is reasonable as the frame rate was much lower.…”
Section: D-cnn+bilstmsupporting
confidence: 55%
“…Formally, our networks has to map each MRI image to a spectral vector. However, using several consecutive input frames instead of a single frame can significantly improve the results [6], [32]. Hence, the input for all our network configurations was a 3D array, treating time as the the third axis besides the two spacial axes of the images.…”
Section: D-cnn+bilstmmentioning
confidence: 99%
See 1 more Smart Citation
“…The input of our network is a 3D array of consecutive images, and the output is a 80-dimensional spectral vector. The convolutional (3D-CNN) network structure that we applied here [30] was the same as the lower, 'frame-level' part of the x-vector network, so we delay its presentation to the next section. As here the task is to estimate spectral vectors, we used a linear output layer and the network was trained to minimize the mean-squared error (MSE) of the regression task.…”
Section: The Ssi Frameworkmentioning
confidence: 99%
“…Here, we experiment with two neural network configurations in our SSI framework, that is, to estimate a speech mel-spectrogram from a sequence of ultrasound images. The first network is a 3D-CNN, following the proposal of Tóth et al [24]. The second configuration combines the 3D-CNN layers with and additional BiLSTM layer, as it may be more effective in aggregating the information along the time axis.…”
Section: Cnns For the Ssi And For The Vad Taskmentioning
confidence: 99%