One of the main concerns of researchers in writing scientific texts such as articles and theses is their correctness in terms of spelling since the presence of spelling errors in these texts is unacceptable. This problem, like many natural language processing problems, is highly dependent on the structure and grammar of the language. Persian language is a challenging language in this area due to the presence of homophonic and dotted letters. In addition, many Arabic terms have entered this language. These words and terms have introduced the challenges of correcting Arabic spelling errors into Persian and created a complex combination. Moreover, due to the fact that many Persian speakers are Muslim, the Arabic content of the holy Qur'an has also found its way into Persian texts in such a way that today there are many Islamic texts with mixed Persian and Arabic content, and there is a great need for a tool that can correct bilingual Arabic and Persian spelling errors. In this work, an approach based on machine learning and an unsupervised algorithm is proposed which is designed based on N-gram language models. The data used here consists of about 220,000 sentences with mixed Arabic and Persian content, from which N-grams are made. The language model benefits from a statistical model derived from the probability of N-grams frequencies to score the possible candidates for the erroneous word and choose the best one. In order to evaluate the proposed method, test data has been prepared for Persian-Islamic content, in which spelling errors have been generated manually. The results of the evaluations show a significant improvement compared to similar tools in the Persian language.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.