Making More of Little Data: Improving Low-Resource Automatic Speech Recognition Using Data Augmentation

Bartelds, Martijn; San, Nay; McDonnell, Bradley; Jurafsky, Dan; Wieling, Martijn

doi:10.18653/v1/2023.acl-long.42

Cited by 7 publications

(2 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This matrix is referred to as the mask. The estimated mask aims to closely resemble the ideal ratio mask (IRM) [13], which is defined in Equation (3).…”

Section: Mask-based Separation Methods In Time Frequency Domainsmentioning

confidence: 99%

“…It is not only hampered by acoustic interference by background noise such as traffic, crowd noise, but also speaker variability like accents, dialects, microphone quality and so on. Bartelds et al reveal issues such as the diminishing returns of data augmentation in data-rich environments and the oversight of sociolinguistic factors, which are critical in diverse linguistic contexts [3]. Furthermore, Li et al discuss the challenges faced by ASR systems in handling continuous speech sequences and streaming speech, highlighting the need for more sophisticated models to tackle these issues [4].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Target Speaker Extraction Using Attention-Enhanced Temporal Convolutional Network

Wang,

Lai,

Tai

et al. 2024

Electronics

View full text Add to dashboard Cite

When recording conversations, there may be multiple people talking at once. While our human ears can filter out unwanted sounds, this can be challenging for automatic speech recognition (ASR) systems, leading to reduced accuracy. To address this issue, preprocessing mechanisms such as speech separation and targeted speaker extraction are necessary to separate each person’s speech. With the development of deep learning, the quality of separated speech has improved significantly. Our objective is to focus on speaker extraction, which entails implementing a primary system for speech extraction and a secondary subsystem for delivering target information. To accomplish this, we have chosen a temporal convolutional network (TCN) architecture as the foundation of our speech extraction model. A TCN enables convolutional neural networks (CNNs) to manage time series modeling, and it can be constructed in various model lengths. Furthermore, we have integrated attention enhancement into the secondary subsystem to provide the speech extraction model with comprehensive and effective target information, which helps to improve the model’s ability to estimate masks. As a result, the quality of the target speaker extraction will be greatly enhanced with a more precise mask.

show abstract

“…This matrix is referred to as the mask. The estimated mask aims to closely resemble the ideal ratio mask (IRM) [13], which is defined in Equation (3).…”

Section: Mask-based Separation Methods In Time Frequency Domainsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%