2022
DOI: 10.3390/s22031133
|View full text |Cite
|
Sign up to set email alerts
|

Beyond the Edge: Markerless Pose Estimation of Speech Articulators from Ultrasound and Camera Images Using DeepLabCut

Abstract: Automatic feature extraction from images of speech articulators is currently achieved by detecting edges. Here, we investigate the use of pose estimation deep neural nets with transfer learning to perform markerless estimation of speech articulator keypoints using only a few hundred hand-labelled images as training input. Midsagittal ultrasound images of the tongue, jaw, and hyoid and camera images of the lips were hand-labelled with keypoints, trained using DeepLabCut and evaluated on unseen speakers and syst… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
8
1

Relationship

0
9

Authors

Journals

citations
Cited by 18 publications
(11 citation statements)
references
References 33 publications
0
11
0
Order By: Relevance
“…Still ultrasound images were aligned to the audio recording using pulses generated by the Articulate Instruments PStretch unit, recorded in AAA alongside the speech signal. Tongue splines were automatically fit with DeepLabCut (Mathis et al, 2018;Nath et al, 2019) using the MobileNet1.0-based neural network implemented in AAA (Wrench and Balch-Tomes, 2022). Tongue coordinates were rotated to a common horizontal plane (Lawson et al, 2019;Scobbie et al, 2011) by visually estimating the orientation of the ultrasound probe and camera from the side-view lip video data.…”
Section: Discussionmentioning
confidence: 99%
“…Still ultrasound images were aligned to the audio recording using pulses generated by the Articulate Instruments PStretch unit, recorded in AAA alongside the speech signal. Tongue splines were automatically fit with DeepLabCut (Mathis et al, 2018;Nath et al, 2019) using the MobileNet1.0-based neural network implemented in AAA (Wrench and Balch-Tomes, 2022). Tongue coordinates were rotated to a common horizontal plane (Lawson et al, 2019;Scobbie et al, 2011) by visually estimating the orientation of the ultrasound probe and camera from the side-view lip video data.…”
Section: Discussionmentioning
confidence: 99%
“…Human pose estimation technology such as Convolutional Pose Machines (CPM) and convolution neural network (CNN) based methods which allow extraction of human movement information directly from video clips have been repeatedly tested by researchers [ 85 , 86 ] while human pose estimation application on analyzing movement in the disease populations were reported to be useful by the studies in our review [ 14 , 16 25 , 27 , 29 , 32 38 , 41 , 44 , 50 55 , 57 , 66 , 71 74 ]. Given that such trajectory extraction method is in rapid evaluation and is becoming more mature for promising identification of posture [ 87 89 ], using hand-held camera or smartphone as the MMC system would be especially beneficial for understanding the motor performance of individuals in their daily living tasks, hence providing valuable information on levels of impairment and on the constraints that patients might encounter in their activities of daily living in their real-life environment. It is understandable that individuals, particularly young children and older people, might behave differently when they are placed for motion capturing in an unfamiliar laboratory or a simulated environment, thus risking the possibility that the motion analysis might not truly reflect the individuals’ actual movement patterns [ 90 ].…”
Section: Discussionmentioning
confidence: 99%
“…While DLC has been used extensively to track animal and human features during movements, its ability to track features in US videos has been minimally explored. [ 46 ] used DLC to track the upper surface of the tongue and compared it to other US contour estimators, concluding that DLC requires significantly less training data to perform with the same level of accuracy. [ 47 ] used DLC to track the gastrocnemius muscle–tendon junction, observing the morphology of the lower leg longitudinally.…”
Section: Methodsmentioning
confidence: 99%
“…The manually labeled data of both Group 1 and Group 2 was restructured into the appropriate file types for DLC to use for training. The training data was augmented using the imgaug method ( https://github.com/aleju/imgaug ) and a 50-layer ResNet network was re-trained using this data for 500 k iterations, where error commonly plateaus [ 46 ] for ResNet50.…”
Section: Methodsmentioning
confidence: 99%