Text classification is a very important area in information retrieval. Text classification techniques used to classify documents into a set of predefined categories. There are several techniques and methods used to classify data and in fact there are many researches talks about English text classification. Unfortunately, few researches talks about Arabic text classification. This paper talks about three well-known techniques used to classify data. These three well-known techniques are applied on Arabic data set. A comparative study is made between these three techniques. Also this study used fixed number of documents for all categories of documents in training and testing phase. The result shows that the Support Vector machine gives the best results.
Arabic automatic speech recognition (ASR) methods with diacritics have the ability to be integrated with other systems better than Arabic ASR methods without diacritics. In this work, the application of state-of-the-art end-to-end deep learning approaches is investigated to build a robust diacritised Arabic ASR. These approaches are based on the Mel-Frequency Cepstral Coefficients and the log Mel-Scale Filter Bank energies as acoustic features. To the best of our knowledge, end-to-end deep learning approach has not been used in the task of diacritised Arabic automatic speech recognition. To fill this gap, this work presents a new CTC-based ASR, CNN-LSTM, and an attention-based end-to-end approach for improving diacritisedArabic ASR. In addition, a word-based language model is employed to achieve better results. The end-to-end approaches applied in this work are based on state-of-the-art frameworks, namely ESPnet and Espresso. Training and testing of these frameworks are performed based on the Standard Arabic Single Speaker Corpus (SASSC), which contains 7 h of modern standard Arabic speech. Experimental results show that the CNN-LSTM with an attention framework outperforms conventional ASR and the Joint CTC-attention ASR framework in the task of Arabic speech recognition. The CNN-LSTM with an attention framework could achieve a word error rate better than conventional ASR and the Joint CTC-attention ASR by 5.24% and 2.62%, respectively.This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
Arabic language has a set of sound letters called diacritics, these diacritics play an essential role in the meaning of words and their articulations. The change in some diacritics leads to a change in the context of the sentence. However, the existence of these letters in the corpus transcription affects the accuracy of speech recognition. In this paper, we investigate the effect of diactrics on the Arabic speech recognition based end-to-end deep learning. The applied end-to-end approach includes CNN-LSTM and attention-based technique presented in the state-of-the-art framework namely, Espresso using Pytorch. In addition, and to the best of our knowledge, the approach of CNN-LSTM with attention-based has not been used in the task of Arabic Automatic speech recognition (ASR). To fill this gap, this paper proposes a new approach based on CNN-LSTM with attention based method for Arabic ASR. The language model in this approach is trained using RNN-LM and LSTM-LM and based on nondiacritized transcription of the speech corpus. The Standard Arabic Single Speaker Corpus (SASSC), after omitting the diacritics, is used to train and test the deep learning model. Experimental results show that the removal of diacritics decreased out-of-vocabulary and perplexity of the language model. In addition, the word error rate (WER) is significantly improved when compared to diacritized data. The achieved average reduction in WER is 13.52%.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.