UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

Eshky, Aciel; Ribeiro, Manuel Sam; Cleland, Joanne; Richmond, Korin; Roxburgh, Zoe; Scobbie, James M.; Wrench, Alan

doi:10.21437/interspeech.2018-1736

Cited by 41 publications

(34 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Uncorrelated segments: Speech therapy data contains interactions between the therapist and patient. The audio therefore contains speech from both speakers, while the ultrasound captures only the patient's tongue [16]. As a result, parts of the recordings will consist of completely uncorrelated audio and ultrasound.…”

Section: Lip Videos Vs Ultrasound Tongue Imaging (Uti)mentioning

confidence: 99%

“…This allows us to control how the model is trained and verify its performance using ground truth synchronisation offsets. We use Ul-traSuite 2 : a repository of ultrasound and acoustic data gathered from child speech therapy sessions [16]. We used all three datasets from the repository: UXTD (recorded with typically developing children), and UXSSD and UPX (recorded with children with speech sound disorders).…”

Section: Datamentioning

confidence: 99%

See 1 more Smart Citation

Synchronising Audio and Ultrasound by Learning Cross-Modal Embeddings

Eshky¹,

Ribeiro²,

Richmond

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

Audiovisual synchronisation is the task of determining the time offset between speech audio and a video recording of the articulators. In child speech therapy, audio and ultrasound videos of the tongue are captured using instruments which rely on hardware to synchronise the two modalities at recording time. Hardware synchronisation can fail in practice, and no mechanism exists to synchronise the signals post hoc. To address this problem, we employ a two-stream neural network which exploits the correlation between the two modalities to find the offset. We train our model on recordings from 69 speakers, and show that it correctly synchronises 82.9% of test utterances from unseen therapy sessions and unseen speakers, thus considerably reducing the number of utterances to be manually synchronised. An analysis of model performance on the test utterances shows that directed phone articulations are more difficult to automatically synchronise compared to utterances containing natural variation in speech such as words, sentences, or conversations.

show abstract

Section: Lip Videos Vs Ultrasound Tongue Imaging (Uti)mentioning

confidence: 99%

Section: Datamentioning

confidence: 99%

Synchronising Audio and Ultrasound by Learning Cross-Modal Embeddings

Eshky¹,

Ribeiro²,

Richmond

et al. 2019

Interspeech 2019

Self Cite

View full text Add to dashboard Cite

show abstract

“…Although ultrasound imaging is becoming less expensive to acquire, there is still a lack of large publicly available databases to evaluate automatic processing methods. The UltraSuite Repository [20], which we use in this work, helps alleviate this issue, but it still does not compare to standard speech recognition or image classification databases, which contain hundreds of hours of speech or millions of images.…”

Section: Ultrasound Tongue Imagingmentioning

confidence: 99%

“…We use the Ultrax Typically Developing dataset (UXTD) from the publicly available UltraSuite repository 1 [20]. This dataset contains synchronized acoustic and ultrasound data from 58 typically developing children, aged 5-12 years old (31 female, 27 male).…”

Section: Ultrasound Datamentioning

confidence: 99%

“…This dataset contains synchronized acoustic and ultrasound data from 58 typically developing children, aged 5-12 years old (31 female, 27 male). The data was aligned at the phone-level, according to the methods described in [20,26]. For this work, we discarded the acoustic data and focused only on the B-Mode ultrasound images capturing a midsaggital view of the tongue.…”

Section: Ultrasound Datamentioning

confidence: 99%

See 1 more Smart Citation

Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

Ribeiro

Eshky

Richmond

et al. 2019

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

Self Cite

View full text Add to dashboard Cite

Ultrasound tongue imaging (UTI) provides a convenient way to visualize the vocal tract during speech production. UTI is increasingly being used for speech therapy, making it important to develop automatic methods to assist various time-consuming manual tasks currently performed by speech therapists. A key challenge is to generalize the automatic processing of ultrasound tongue images to previously unseen speakers. In this work, we investigate the classification of phonetic segments (tongue shapes) from raw ultrasound recordings under several training scenarios: speaker-dependent, multi-speaker, speaker-independent, and speaker-adapted. We observe that models underperform when applied to data from speakers not seen at training time. However, when provided with minimal additional speaker information, such as the mean ultrasound frame, the models generalize better to unseen speakers.

show abstract

Continuous feature learning representation to XGBoost classifier on the aggregation of discriminative Features using DenseNet-121 architecture and ResNet 18 architectures towards Apraxia Recognition in the Child Speech Therapy

Ashwini,

Bharathi

2024

Int J Speech Technol

View full text Add to dashboard Cite

UltraSuite: A Repository of Ultrasound and Acoustic Data from Child Speech Therapy Sessions

Cited by 41 publications

References 28 publications

Synchronising Audio and Ultrasound by Learning Cross-Modal Embeddings

Synchronising Audio and Ultrasound by Learning Cross-Modal Embeddings

Speaker-independent Classification of Phonetic Segments from Raw Ultrasound in Child Speech

Continuous feature learning representation to XGBoost classifier on the aggregation of discriminative Features using DenseNet-121 architecture and ResNet 18 architectures towards Apraxia Recognition in the Child Speech Therapy

Contact Info

Product

Resources

About