Ankyrin containing proteins are one of the most abundant repeat protein families present in all extant organisms. They are made with tandem copies of similar amino acid stretches that fold into elongated architectures. Here, we build and curated a dataset of 200 thousand proteins that contain 1,2 million Ankyrin regions and characterize the abundance, structure and energetics of the repetitive regions in natural proteins. We found that there is a continuous roughly exponential variety of array lengths with an exceptional frequency at 24 repeats. We describe that individual repeats are seldom interrupted with long insertions and accept few deletions, consistently with the know tertiary structures. We found that longer arrays are made up of repeats that are more similar to each other than shorter arrays, and display more favourable folding energy, hinting at their evolutionary origin. The array distributions show that there is a physical upper limit to the size of an array of Ankyrin repeats of about 120 copies, consistent with the limit found in nature. Analysis of the identity patterns within the arrays suggest that they may have originated by sequential copies of more than one Ankyrin unit.
Author summaryRepeat proteins are coded in tandem copies of similar amino acid stretches. We built and curated a large dataset of Ankyrin containing proteins, one of the most abundant families of repeat proteins, and characterized the structure of the arrays formed by the repetitions. We found that large arrays are constructed with repetitions that are more similar to each other than shorter arrays. Also, the largest the array, the more energetically favourable its folding energy is. We speculate about the mechanistic origin of large arrays and hint into their evolutionary dynamics.Natural proteins that are formed with repetitions of stretches of amino-acids are 2 abundant in extant organisms [1]. Some proteins contain repetitions of short stretches, 3 forming fibrillate structures like collagen, and some contain longer repetitions of 4 globular domains like beads on a string. In between, there is a class of proteins that is 5 formed with tandem repetitions of similar stretches of about 30∼40 residues. These kind 6 of proteins (from now on repeat proteins) are present in all organisms and are believed 7 to be ancient systems [2]. Typically these polypeptides form elongated structures where 8 November 25, 2019 1/23 each repeat motifs packs against its nearest neighbors, stabilizing an overall 9super-helical fold [3]. Since most of the structural characterization of these proteins 10 were performed on model systems of short arrays that are experimentally amenable, we 11 aim at characterizing the overall structures of an abundant family of proteins.
12Ankyrin repeat proteins (ANKs) are usually described as formed with linear arrays 13 of tandem copies of a 33 residues length motif that fold to a α-loop-α − β-hairpin/loop. 14 Being one of the most common repeat proteins in nature, these molecules are believed 15 to function as sp...