Word clustering is a serious challenge in low-resource languages. Since words that share semantics are expected to be clustered together, it is common to use a feature vector representation generated from a distributional theory-based word embedding method. The goal of this work is to utilize Modern Standard Arabic (MSA) for better clustering performance of the low-resource Iraqi vocabulary. We began with a new Dialect Fast Stemming Algorithm (DFSA) that utilizes the MSA data. The proposed algorithm achieved 0.85 accuracy measured by the F1 score. Then, the distributional theory-based word embedding method and a new simple, yet effective, feature vector named Wasf-Vec word embedding are tested. Wasf-Vec word representation utilizes a word’s topology features. The difference between Wasf-Vec and distributional theory-based word embedding is that Wasf-Vec captures relations that are not contextually based. The embedding is followed by an analysis of how the dialect words are clustered within other MSA words. The analysis is based on the word semantic relations that are well supported by solid linguistic theories to shed light on the strong and weak word relation representations identified by each embedding method. The analysis is handled by visualizing the feature vector in two-dimensional (2D) space. The feature vectors of the distributional theory-based word embedding method are plotted in 2D space using the t-sne algorithm, while the Wasf-Vec feature vectors are plotted directly in 2D space. A word’s nearest neighbors and the distance-histograms of the plotted words are examined. For validation purpose of the word classification used in this article, the produced classes are employed in Class-based Language Modeling (CBLM). Wasf-Vec CBLM achieved a 7% lower perplexity (pp) than the distributional theory-based word embedding method CBLM. This result is significant when working with low-resource languages.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.