Following recent advancements in deep learning and artificial intelligence, spoken language identification applications are playing an increasingly significant role in our day-to-day lives, especially in the domain of multi-lingual speech recognition. In this article, we propose a spoken language identification system that depends on the sequence of feature vectors. The proposed system uses a hybrid Convolutional Recurrent Neural Network (CRNN), which combines a Convolutional Neural Network (CNN) with a Recurrent Neural Network (RNN) network, for spoken language identification on seven languages, including Arabic, chosen from subsets of the Mozilla Common Voice (MCV) corpus. The proposed system exploits the advantages of both CNN and RNN architectures to construct the CRNN architecture. At the feature extraction stage, it compares the Gammatone Cepstral Coefficient (GTCC) feature and Mel Frequency Cepstral Coefficient (MFCC) feature, as well as a combination of both. Finally, the speech signals were represented as frames and used as the input for the CRNN architecture. After conducting experiments, the results of the proposed system indicate higher performance with combined GTCC and MFCC features compared to GTCC or MFCC features used individually. The average accuracy of the proposed system was 92.81% in the best experiment for spoken language identification. Furthermore, the system can learn language-specific patterns in various filter size representations of speech files.
Crowd density estimation is an important topic in computer vision due to its widespread applications in surveillance, urban planning, and intelligence gathering. Resulting from extensive analysis, crowd density estimation reflects many aspects such as similarity of appearance between people, background components, and inter-blocking in intense crowds. In this paper, we are interested to apply machine learning for crowd management in order to monitor populated area and prevent congestion situations. We propose a Single-Convolutional Neural Network with Three Layers (S-CNN3) model to count the number of people in a scene and conclude about the crowd estimation. Then, a comparative study for density counting establishes the performance of the proposed model against the convolutional neural networks with four layers (single-CNN4) and Switched Convolutional neural networks (SCNN). ShanghaiTech dataset, considered as the largest data base for crowd counting, is used in this work. The proposed model proves high effectiveness and efficiency for crowd density estimation with 99.88% of average test accuracy and 0.02 of average validation loss. These results achieve better performance than the existing state-of-the-art models.
Spoken Language Identification (SLID) is an important step in speech-to-speech translation systems and multi-lingual automatic speech recognition. In recent research, deep learning mechanisms have been the prevailing approaches for spoken language identification. This paper aims to study, detect, and analyze spoken languages similar to Arabic in pronouncing certain words and then proposes a deep learningbased architecture, specifically the Bidirectional Long Short Term Memory (BLSTM), for spoken Arabic language identification and discrimination between these similar languages, namely, German, Spanish, French, and Russian, all of which are taken from Mozilla speech corpus languages. Additionally, our work involves a linguistic study of these considered languages. A total of ten thousand speakers are chosen for all five languages, and the BLSTM architecture is designed and implemented using acoustic signal features and applied to five experiments in this paper. The results show a precision of 98.97%, 98.73%, 98.47%, and 99.75% for identifying the spoken Arabic language separately along with German, Spanish, French, and Russian, respectively. Additionally, we achieved an average accuracy of 95.15% for discriminating between all these considered five languages in terms of the pronunciation of words. Our findings confirm that a BLSTM architecture is able to distinguish between observable similar pronunciations of words in considered languages.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.