Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-ofthe-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. 1
Background: Music therapy, an innovative approach that has proven effectiveness in many medical conditions, seems beneficial also in managing surgical patients. The aim of this study is to evaluate its effects, under general anesthesia, on perioperative patient satisfaction, stress, pain, and awareness. Methods: This is a prospective, randomized, double-blind study conducted in the operating theatre of visceral surgery at Sahloul Teaching Hospital over a period of 4 months. Patients aged more than 18 undergoing a scheduled surgery under general anesthesia were included. Patients undergoing urgent surgery or presenting hearing or cognitive disorders were excluded. Before induction, patients wore headphones linked to an MP3 player. They were randomly allocated into 2 groups: Group M (with music during surgery) and group C (without music). Hemodynamic parameters, quality of arousal, pain experienced, patient’s satisfaction, and awareness incidence during anesthesia were recorded. Results: One hundred and forty patients were included and allocated into 2 groups that were comparable in demographic characteristics, surgical intervention type and anesthesia duration. Comparison of these two groups regarding the hemodynamic profile found more stability in group M for systolic arterial blood pressure. A calm recovery was more often noted in group M (77.1% versus 44%, p < 10–3). The average Visual Analog Scale (VAS) score was lower in the intervention group (33.8 ± 13.63 versus 45.1 ± 16.2; p < 10–3). The satisfaction rate was significantly higher among the experimental group (81.4% versus 51.4%; p < 10–3). The incidence of intraoperative awareness was higher in group C (8 cases versus 3 cases) but the difference was not statistically significant. Conclusion: Music therapy is a non-pharmacological, inexpensive, and non-invasive technique that can significantly enhance patient satisfaction and decrease patients’ embarrassing experiences related to perioperative stress, pain, and awareness.
Video recordings of speech contain correlated audio and visual information, providing a strong signal for speech representation learning from the speaker's lip movements and the produced sound. We introduce Audio-Visual Hidden Unit BERT (AV-HuBERT), a self-supervised representation learning framework for audio-visual speech, which masks multi-stream video input and predicts automatically discovered and iteratively refined multimodal hidden units. AV-HuBERT learns powerful audio-visual speech representation benefiting both lip-reading and automatic speech recognition. On the largest public lip-reading benchmark LRS3 (433 hours), AV-HuBERT achieves 32.5% WER with only 30 hours of labeled data, outperforming the former state-of-the-art approach (33.6%) trained with a thousand times more transcribed video data (31K hours) (Makino et al., 2019). The lip-reading WER is further reduced to 26.9% when using all 433 hours of labeled data from LRS3 and combined with self-training. Using our audio-visual representation on the same benchmark for audio-only speech recognition leads to a 40% relative WER reduction over the state-of-the-art performance (1.3% vs 2.3%). Our code and models are available at https://github.com/ facebookresearch/av_hubert
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.