Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

Yu, Zhang; Qin, James; Park, Daniel; Han, Wei; Chiu, Chung‐Cheng; Pang, Ruoming; Le, Quoc V.; Wu, Yonghui

doi:10.48550/arxiv.2010.10504

Cited by 84 publications

(136 citation statements)

References 47 publications

Supporting

Mentioning

130

Contrasting

Unclassified

Order By: Relevance

“…By replacing the underlying long short-term memory (LSTM) [25] with Transformer [26] in the encoder, which allows a more powerful attention mechanism to be used, CTC thrives again in recent studies [27]. It gets further boosted by the emerged self-supervised learning technologies [28][29][30][31] which can learn a very good representation that carries semantic information.…”

Section: A) Connectionist Temporal Classificationmentioning

confidence: 99%

“…SSL is even more powerful because it does not need any labeled data for pre-training, naturally solving the low-resource challenge. Therefore, SSL is becoming a new trend which especially works very well for ASR on resource limited languages [28][29][30][31][278][279][280][281], with representative technologies such as wav2vec 2.0 [28], autoregressive predictive coding [279], and HuBERT [31]. While most SSL studies focus on very limited supervised training data (e.g., 1000 hours), there are also recent studies showing promising results on industry-scale tens of thousand hours supervised training data [282,283].…”

Section: Miscellaneous Topicsmentioning

confidence: 99%

See 1 more Smart Citation

Recent Advances in End-to-End Automatic Speech Recognition

Li¹

2021

Preprint

View full text Add to dashboard Cite

Recently, the speech community is seeing a significant trend of moving from deep neural network based hybrid modeling to end-to-end (E2E) modeling for automatic speech recognition (ASR). While E2E models achieve the state-of-the-art results in most benchmarks in terms of ASR accuracy, hybrid models are still used in a large proportion of commercial ASR systems at the current time. There are lots of practical factors that affect the production model deployment decision. Traditional hybrid models, being optimized for production for decades, are usually good at these factors. Without providing excellent solutions to all these factors, it is hard for E2E models to be widely commercialized. In this paper, we will overview the recent advances in E2E models, focusing on technologies addressing those challenges from the industry's perspective.

show abstract

Section: A) Connectionist Temporal Classificationmentioning

confidence: 99%

Section: Miscellaneous Topicsmentioning

confidence: 99%

Recent Advances in End-to-End Automatic Speech Recognition

Li¹

2021

Preprint

View full text Add to dashboard Cite

show abstract

“…512-point FFT is used to extract 257-dimensional LPS. 6 pairs of microphones are selected for IPD and TPD computation, which are (0, 7), (1,6), (2,5), (3,4), (4, 7), (3,4). The total dimension of the input feature after concatenation is 257 × (1 + 6 + 1) = 2056.…”

Section: Separation Modulementioning

confidence: 99%

“…With the development of speech techniques and deep neural networks, dramatic improvement has been achieved on multiple automatic speech recognition (ASR) benchmarks [1,2,3,4]. However, it remains a challenging task for multi-channel multi-speaker overlapped speech recognition due to the interfering speakers or background noise [5,6,7].…”

Section: Introductionmentioning

confidence: 99%

Multi-Channel Multi-Speaker ASR Using 3D Spatial Feature

Shao¹,

Zhang²,

Yu³

2021

Preprint

View full text Add to dashboard Cite

Automatic speech recognition (ASR) of multi-channel multispeaker overlapped speech remains one of the most challenging tasks to the speech community. In this paper, we look into this challenge by utilizing the location information of target speakers in the 3D space for the first time. To explore the strength of proposed the 3D spatial feature, two paradigms are investigated. 1) a pipelined system with a multi-channel speech separation module followed by the state-of-the-art single-channel ASR module; 2) a "All-In-One" model where the 3D spatial feature is directly used as an input to ASR system without explicit separation modules. Both of them are fully differentiable and can be back-propagated end-toend. We test them on simulated overlapped speech and real recordings. Experimental results show that 1) the proposed ALL-In-One model achieved a comparable error rate to the pipelined system while reducing the inference time by half; 2) the proposed 3D spatial feature significantly outperformed (31% CERR) all previous works of using the 1D directional information in both paradigms.

show abstract

“…Recent developments in automatic speech recognition (ASR) for spoken languages [13,14,65,70] Text-based sign language video retrieval: In this work we introduce sign language video retrieval with free-form textual queries, the task of searching collections of sign language videos to find the best match for a free-form textual query, going beyond single keyword search.…”

Section: Introductionmentioning

confidence: 99%

Sign Language Video Retrieval with Free-Form Textual Queries

Duarte¹,

Albanie²,

Giró-i-Nieto³

et al. 2022

Preprint

View full text Add to dashboard Cite

Systems that can efficiently search collections of sign language videos have been highlighted as a useful application of sign language technology. However, the problem of searching videos beyond individual keywords has received limited attention in the literature. To address this gap, in this work we introduce the task of sign language retrieval with free-form 1 textual queries: given a written query (e.g., a sentence) and a large collection of sign language videos, the objective is to find the signing video in the collection that best matches the written query.We propose to tackle this task by learning crossmodal embeddings on the recently introduced large-scale How2Sign dataset of American Sign Language (ASL). We identify that a key bottleneck in the performance of the system is the quality of the sign video embedding which suffers from a scarcity of labelled training data. We, therefore, propose SPOT-ALIGN, a framework for interleaving iterative rounds of sign spotting and feature alignment to expand the scope and scale of available training data. We validate the effectiveness of SPOT-ALIGN for learning a robust sign video embedding through improvements in both sign recognition and the proposed video retrieval task.

show abstract

Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition

Cited by 84 publications

References 47 publications

Recent Advances in End-to-End Automatic Speech Recognition

Recent Advances in End-to-End Automatic Speech Recognition

Multi-Channel Multi-Speaker ASR Using 3D Spatial Feature

Sign Language Video Retrieval with Free-Form Textual Queries

Contact Info

Product

Resources

About