A perceptual similarity space for speech based on self-supervised speech representations

Chernyak, Bronya R.; Bradlow, Ann R.; Keshet, Joseph; Goldrick, Matthew

doi:10.1121/10.0026358

The Journal of the Acoustical Society of America

2024

DOI: 10.1121/10.0026358

|View full text |Cite

A perceptual similarity space for speech based on self-supervised speech representations

Bronya R. Chernyak,

Ann R. Bradlow,

Joseph Keshet

et al.

Abstract: Speech recognition by both humans and machines frequently fails in non-optimal yet common situations. For example, word recognition error rates for second-language (L2) speech can be high, especially under conditions involving background noise. At the same time, both human and machine speech recognition sometimes shows remarkable robustness against signal- and noise-related degradation. Which acoustic features of speech explain this substantial variation in intelligibility? Current approaches align speech to t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 47 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Dynamic acoustic vowel distances within and across dialects

Clopper

2024

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

Vowels vary in their acoustic similarity across regional dialects of American English, such that some vowels are more similar to one another in some dialects than others. Acoustic vowel distance measures typically evaluate vowel similarity at a discrete time point, resulting in distance estimates that may not fully capture vowel similarity in formant trajectory dynamics. In the current study, language and accent distance measures, which evaluate acoustic distances between talkers over time, were applied to the evaluation of vowel category similarity within talkers. These vowel category distances were then compared across dialects, and their utility in capturing predicted patterns of regional dialect variation in American English was examined. Dynamic time warping of mel-frequency cepstral coefficients was used to assess acoustic distance across the frequency spectrum and captured predicted Southern American English vowel similarity. Root-mean-square distance and generalized additive mixed models were used to assess acoustic distance for selected formant trajectories and captured predicted Southern, New England, and Northern American English vowel similarity. Generalized additive mixed models captured the most predicted variation, but, unlike the other measures, do not return a single acoustic distance value. All three measures are potentially useful for understanding variation in vowel category similarity across dialects.

show abstract

Dynamic acoustic vowel distances within and across dialects

Clopper

2024

The Journal of the Acoustical Society of America

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

A perceptual similarity space for speech based on self-supervised speech representations

Cited by 1 publication

References 47 publications

Dynamic acoustic vowel distances within and across dialects

Dynamic acoustic vowel distances within and across dialects

Contact Info

Product

Resources

About