We have studied the segmentation of two‐letter AB heterosequences composed of subsequences with different composition and distribution of A and B monomer units along the chain. Our approach is based on the segmentation function S(k) introduced in the present work and on the Jensen–Shannon divergence measure determined with respect to the probabilities of the lengths of uniform blocks of A and B monomer units. It is shown that the function S(k) is extremely sensitive to the sequence statistics. Even visual analysis of S(k) allows judgment on some features of sequence statistics. In particular, function S(k) is constant for random copolymers, it is an oscillating function for random block copolymers and shows monotonic growth up to some constant value for proteinlike copolymers. However, due to significant fluctuations observed for short sequences, the function S(k) can be effectively used only for segmentation of a heterosequence composed of very long subsequences. On the other hand, we find that the Jensen–Shannon divergence measure does not allow one to judge the type of statistics, but is extremely efficient for segmentation of a heterosequence. Therefore, the two introduced functions, being mutually complementary, provide an effective approach for recognizing and segmentation of heterosequences. As an example, the methods developed are applied for concatenating sequences of different proteins.Segmentation function S(k, l, x) as a function of parameter k and starting number x of “window” for a sequence composed of elastin and ribonuclease sequences.magnified imageSegmentation function S(k, l, x) as a function of parameter k and starting number x of “window” for a sequence composed of elastin and ribonuclease sequences.
Summary: We have performed analysis of protein sequences treating them as texts written in a “protein” language. We have shown that repeating patterns (words) of various lengths can be identified in these sequences. It was found that the maximum word lengths are different for proteins belonging to different classes; therefore, the corresponding values can be used to characterize the protein type. The suggested technique was first applied to analyze (decompose into words) normal (literature) texts written as a gapless symbolic sequence without spaces and punctuation marks. The tests using fiction, scientific, and popular scientific English texts proved the relative efficiency of the technique.Maximum word length for various proteins: —fibrillar proteins, —globular proteins, —membrane proteins.magnified imageMaximum word length for various proteins: —fibrillar proteins, —globular proteins, —membrane proteins.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.