The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Gálvez, Daniel; Diamos, Greg; Ciro, Juan; Cerón, Juan Felipe; Achorn, Keith; Gopi, Anjali; Kanter, David; Lam, Maximilian; Mazumder, Mark; Reddi, Vijay Janapa

doi:10.48550/arxiv.2111.09344

Cited by 3 publications

(2 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The same approach was used to evaluate NaijaVoice relative to VoxCeleb on the Nigerian mini database [6]. This approach to evaluation is not new, it has been applied by [14] and [51]. This research, however, puts a value to it in terms of relative purity using ( 8) and (9), as shown at the bottom of the page, for evaluating NaijaFace and NaijaVoice respectively.…”

Section: Resultsmentioning

confidence: 99%

NaijaFaceVoice: A Large-Scale Deep Learning Model and Database of Nigerian Faces and Voices

et al. 2023

View full text Add to dashboard Cite

The fusion of two or more traits in multimodal biometrics generally improves recognition accuracy. The question is, by how much? Large-scale databases are better suited for training deep learning models for better generalization and accuracy. Therefore, a large-scale multimodal database is beneficial. However, publicly available large-scale multimodal databases are scarce, especially for faces and voices. Again, because a face image is 2-D while a voice is 1-D, there is the challenge of the best way to fuse both. Therefore, improvements owing to fusion have hitherto yielded marginal improvements. This study proposes a semi-automated curation algorithm for the extraction of the faces and voices of target individuals in videos to create a large-scale face-voice database. The curation technique involves observing the positions at the time of the occurrence of the target subject's faces and voices in videos. These positions are supplied to a MATLAB2017b script that detects the faces in the observed regions, crops, resizes, autolabels, and writes them to the disk. A second MATLAB2017b script, extracts the audio content within the observed regions, auto-labels, and writes the voice segments to the disk. The created database named NaijaFaceVoice consists of 2,656 subjects with over 2 million faces and 195 hours of utterances. The database was employed to develop a large-scale recognition system that leveraged Convolutional Neural Networks. Robust fusion methods incorporating the proposed Spectrogram-Voting concept significantly improved performance achieving a record equal error rate of 0.0003519%, an improvement by a factor of over 450.

show abstract

Section: Resultsmentioning

confidence: 99%

NaijaFaceVoice: A Large-Scale Deep Learning Model and Database of Nigerian Faces and Voices

et al. 2023

View full text Add to dashboard Cite

show abstract

“…English Librispeech [7], mTEDx [18], Gi-gaSpeech [8], MLS [12], The Peoples's Speech [22], CommonVoice [11], MSR-86K…”

Section: Language Corpus Total Hoursmentioning

confidence: 99%

Multilingual Speech Recognition with Corpus Relatedness Sampling

Li¹,

Dalmia²,

Black³

et al. 2019

Interspeech 2019

View full text Add to dashboard Cite

Multilingual acoustic models have been successfully applied to low-resource speech recognition. Most existing works have combined many small corpora together, and pretrained a multilingual model by sampling from each corpus uniformly. The model is eventually fine-tuned on each target corpus. This approach, however, fails to exploit the relatedness and similarity among corpora in the training set. For example, the target corpus might benefit more from a corpus in the same domain or a corpus from a close language. In this work, we propose a simple but useful sampling strategy to take advantage of this relatedness. We first compute the corpus-level embeddings and estimate the similarity between each corpus. Next we start training the multilingual model with uniform-sampling from each corpus at first, then we gradually increase the probability to sample from related corpora based on its similarity with the target corpus. Finally the model would be fine-tuned automatically on the target corpus. Our sampling strategy outperforms the baseline multilingual model on 16 low-resource tasks. Additionally, we demonstrate that our corpus embeddings capture the language and domain information of each corpus.

show abstract

Toward Artificial Intelligence-Enhanced Hybrid Learning: Leveraging Video Content for Personalized Education

Hamzaoui,

Bachiri,

Ouassam

et al. 2024

Proceedings of the 7th International Conference on Networking, Intelligent Systems and Security

View full text Add to dashboard Cite

The People's Speech: A Large-Scale Diverse English Speech Recognition Dataset for Commercial Usage

Cited by 3 publications

References 8 publications

NaijaFaceVoice: A Large-Scale Deep Learning Model and Database of Nigerian Faces and Voices

NaijaFaceVoice: A Large-Scale Deep Learning Model and Database of Nigerian Faces and Voices

Multilingual Speech Recognition with Corpus Relatedness Sampling

Toward Artificial Intelligence-Enhanced Hybrid Learning: Leveraging Video Content for Personalized Education

Contact Info

Product

Resources

About