Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection

Mariotte, Théo; Larcher, Anthony; Montrésor, Silvio; Thomas, Jean-Hugh

doi:10.1109/taslp.2024.3369531

IEEE/ACM Trans. Audio Speech Lang. Process.

2024

DOI: 10.1109/taslp.2024.3369531

|View full text |Cite

Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection

Théo Mariotte,

Anthony Larcher,

Silvio Montrésor

et al.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article1

Relationship

Self Cite0

Independent1

Authors

Journals

Cited by 1 publication

References 55 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Comparison of wav2vec 2.0 models on three speech processing tasks

Kunešová,

Zajíc,

Šmídl

et al. 2024

Int J Speech Technol

View full text Add to dashboard Cite

The current state-of-the-art for various speech processing problems is a sequence-to-sequence model based on a self-attention mechanism known as transformer. The widely used wav2vec 2.0 is a self-supervised transformer model pre-trained on large amounts of unlabeled speech and then fine-tuned for a specific task. The data used for training and fine-tuning, along with the size of the transformer model, play a crucial role in both of these training steps. The most commonly used wav2vec 2.0 models are trained on relatively “clean” data from sources such as the LibriSpeech dataset, but we can expect there to be a benefit in using more realistic data gathered from a variety of acoustic conditions. However, it is not entirely clear how big the difference would be. Investigating this is the main goal of our article. To this end, we utilize wav2vec 2.0 models in three fundamental speech processing tasks: speaker change detection, voice activity detection, and overlapped speech detection, and test them on four real conversation datasets. We compare four wav2vec 2.0 models with different sizes and different data used for pre-training, and we fine-tune them either on in-domain data from the same dataset or on artificial training data created from the LibriSpeech corpus. Our results suggest that richer data that are more similar to the task domain bring better performance than a larger model.

show abstract

Comparison of wav2vec 2.0 models on three speech processing tasks

Kunešová,

Zajíc,

Šmídl

et al. 2024

Int J Speech Technol

View full text Add to dashboard Cite

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Channel-Combination Algorithms for Robust Distant Voice Activity and Overlapped Speech Detection

Cited by 1 publication

References 55 publications

Comparison of wav2vec 2.0 models on three speech processing tasks

Comparison of wav2vec 2.0 models on three speech processing tasks

Contact Info

Product

Resources

About