We developed a novel classification of concept attributes and two supervised classifiers using this classification to identify concept attributes from candidate attributes extracted from the Web. Our binary (attribute / non-attribute) classifier achieves an accuracy of 81.82% whereas our 5-way classifier achieves 80.35%.
Abstract:We propose a machine learning method for recognizing modern Arabic poems based on the common poetic features of modern Arabic poetry. The poetic features include: rhyming, repetition, use of diacritics and punctuations, and text alignment. The method can classify text documents as poem or non-poem documents with a very high accuracy of 99.81%.
In this paper, we propose an Arabic word segmentation technique based on a bi-directional long short-term memory deep neural network. This paper addresses the two tasks of word segmentation only and word segmentation for nine cases of the rewrite. Word segmentation with a rewrite concerns inferring letters that are dropped or changed when the main word unit is attached to another unit, and it writes these letters back when the two units are separated as a result of segmentation. We only use binary labels as indicators of segmentation positions. Therefore, label 1 is an indicator of the start of a new word (split) in a sequence of symbols not including whitespace, and label 0 is an indicator for any other case (no-split). This is different from the mainstream feature representation for word segmentation in which multi-valued labeling is used to mark the sequence symbols: beginning, inside, and outside. We used the Arabic Treebank data and its clitics segmentation scheme in our experiments. The trained model without the help of any additional language resources, such as dictionaries, morphological analyzers, or rules, achieved a high F1 value for the Arabic word segmentation only (98.03%) and Arabic word segmentation with the rewrite (more than 99% for frequent rewrite cases). We also compared our model with four state-of-the-art Arabic word segmenters. It performed better than the other segmenters on a modern standard Arabic text, and it was the best among the segmenters that do not use any additional language resources in another test using classical Arabic text.INDEX TERMS Arabic word segmentation, bi-directional long short-term memory, deep learning, neural network, word embedding.
Arabic diacritics play a significant role in distinguishing words with the same orthography but different meanings, pronunciations, and syntactic functions. The presence of Arabic diacritics can be useful in many natural language processing applications, such as text-to-speech tasks, machine translation, and partof-speech tagging. This paper discusses the use of bidirectional long short-term memory neural networks with conditional random fields for Arabic diacritization. This approach requires no morphological analyzers, dictionary, or feature engineering, but rather uses a sequence-to-sequence schema. The input is a sequence of characters that constitute the sentence, and the output consists of the corresponding diacritic(s) for each character in that sentence. The performance of the proposed approach was examined using four datasets with different sizes and genres, namely, the King Abdulaziz City for Science and Technology text-to-speech (KACST TTS) dataset, the Holy Quran, Sahih Al-Bukhary, and the Penn Arabic Treebank (ATB). For training, 60% of the sentences were randomly selected from each dataset, 20% were selected for validation, and 20% were selected for testing. The trained models achieved diacritic error rates of 3.41%, 1.34%, 1.57%, and 2.13% and word error rates of 14.46%, 4.92%, 5.65%, and 8.43% on the KACST TTS, Holy Quran, Sahih Al-Bukhary, and ATB datasets, respectively. Comparison of the proposed method with those used in other studies and existing systems revealed that its results are comparable to or better than those of the stateof-the-art methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.