We examined how letter position coding is achieved in a script (Arabic) in which the different letter forms (i.e., allographs) may vary depending on their position within the letter string (e.g., compare the same-ligation pair [see text] and [see text] vs. the different-ligation pair [see text] and [see text]. To that end, we conducted an experiment in Uyghur, an agglutinative language from the Turkic family that employs an Arabic-based script in which both consonants and vowels are explicitly written. Participants had to reproduce the correct word forms in rapid serial visual presentation sentences that either contained jumbled words (with the same ligation or different ligation) or were intact. The results revealed that readers had more difficulty correctly reporting the target words in the jumbled sentences when the letter transposition involved changes in the ligation pattern, thus demonstrating that position-dependent allography affects letter position coding. This finding poses constraints to a universal model of letter position encoding.
One basic feature of the Arabic script is its semicursive style: some letters are connected to the next, but others are not, as in the Uyghur word [see text]/ya xʃi/ ("good"). None of the current orthographic coding schemes in models of visual-word recognition, which were created for the Roman script, assign a differential role to the coding of within letter "chunks" and between letter "chunks" in words in the Arabic script. To examine how letter identity/position is coded at the earliest stages of word processing in the Arabic script, we conducted 2 masked priming lexical decision experiments in Uyghur, an agglutinative Turkic language. The target word was preceded by an identical prime, by a transposed-letter nonword prime (that either kept the ligation pattern or did not), or by a 2-letter replacement nonword prime. Transposed-letter primes were as effective as identity primes when the letter transposition in the prime kept the same ligation pattern as the target word (e.g., [see text]/inta_jin/-/itna_jin/), but not when the transposed-letter prime didn't keep the ligation pattern (e.g., [see text]/so_w_ʁa_t/-/so_ʁw_a_t/). Furthermore, replacement-letter primes were more effective when they kept the ligation pattern of the target word than when they did not (e.g., [see text]/so_d_ʧa_t/-/so_w_ʁa_t/ faster than [see text]/so_ʧd_a_t/-/so_w_ʁa_t/). We examined how input coding schemes could be extended to deal with the intricacies of semicursive scripts.
To improve utilization of text storage resources and efficiency of data transmission, we proposed two syllable-based Uyghur text compression coding schemes. First, according to the statistics of syllable coverage of the corpus text, we constructed a 12-bit and 16-bit syllable code tables and added commonly used symbols—such as punctuation marks and ASCII characters—to the code tables. To enable the coding scheme to process Uyghur texts mixed with other language symbols, we introduced a flag code in the compression process to distinguish the Unicode encodings that were not in the code table. The experiments showed that the 12-bit coding scheme had an average compression ratio of 0.3 on Uyghur text less than 4 KB in size and that the 16-bit coding scheme had an average compression ratio of 0.5 on text less than 2 KB in size. Our compression schemes outperformed GZip, BZip2, and the LZW algorithm on short text and could be effectively applied to the compression of Uyghur short text for storage and applications.
Pattern matching is widely used in various fields such as information retrieval, natural language processing (NLP), data mining and network security. In Uyghur (a typical agglutinative, low-resource language with complex morphology, spoken by the ethnic Uyghur group in Xinjiang, China), research on pattern matching is also ongoing. Due to the language characteristics, the pattern matching using characters and words as basic units has insufficient performance. There are two problems for pattern matching: (1) vowel weakening and (2) morphological changes caused by suffixes. In view of the above problems, this paper proposes a Boyer–Moore-U (BM-U) algorithm and a retrievable syllable coding format based on the syllable features of the Uyghur language and the improvement of the Boyer–Moore (BM) algorithm. This algorithm uses syllable features to perform pattern matching, which effectively solves the problem of weakening vowels, and it can better match words with stem shape changes. Finally, in the pattern matching experiments based on character-encoded text and syllable-encoded text for vowel-weakened words, the BM-U algorithm precision, recall, F1-measure and accuracy are improved by 4%, 55%, 33%, 25% and 10%, 52%, 38%, 38% compared to the BM algorithm.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.