Extant protein-coding sequences span a huge range of ages, from those that emerged only recently to those present in the last universal common ancestor. Because evolution has had less time to act on young sequences, there might be ‘phylostratigraphy’ trends in any properties that evolve slowly with age. A long-term reduction in hydrophobicity and hydrophobic clustering was found in previous, taxonomically restricted studies. Here we perform integrated phylostratigraphy across 435 fully sequenced species, using sensitive HMM methods to detect protein domain homology. We find that the reduction in hydrophobic clustering is universal across lineages. However, only young animal domains have a tendency to have higher structural disorder. Among ancient domains, trends in amino acid composition reflect the order of recruitment into the genetic code, suggesting that the composition of the contemporary descendants of ancient sequences reflects amino acid availability during the earliest stages of life, when these sequences first emerged.
The same nucleotide sequence can encode two protein products in different reading frames. Overlapping gene regions encode higher levels of intrinsic structural disorder (ISD) than nonoverlapping genes (39% 25% in our viral dataset). This might be because of the intrinsic properties of the genetic code, because one member per pair was recently born in a process that favors high ISD, or because high ISD relieves increased evolutionary constraint imposed by dual-coding. Here, we quantify the relative contributions of these three alternative hypotheses. We estimate that the recency of gene birth explains [Formula: see text] or more of the elevation in ISD in overlapping regions of viral genes. While the two reading frames within a same-strand overlapping gene pair have markedly different ISD tendencies that must be controlled for, their effects cancel out to make no net contribution to ISD. The remaining elevation of ISD in the older members of overlapping gene pairs, presumed due to the need to alleviate evolutionary constraint, was already present prior to the origin of the overlap. Same-strand overlapping gene birth events can occur in two different frames, favoring high ISD either in the ancestral gene or in the novel gene; surprisingly, most gene birth events contained completely within the body of an ancestral gene favor high ISD in the ancestral gene (23 phylogenetically independent events 1). This can be explained by mutation bias favoring the frame with more start codons and fewer stop codons.
The same nucleotide sequence can encode two protein products in different reading frames. Overlapping gene regions are known to encode higher levels of intrinsic structural disorder (ISD) than non-overlapping genes (39% vs. 25% in our viral dataset). Two explanations for elevated ISD have been proposed: that high ISD relieves the increased evolutionary constraint imposed by dual-coding, and that one member per pair was recently born de novo in a process that favors high ISD. Here we quantify the relative contributions of these two alternative hypotheses, as well as a third hypothesis that has not previously been explored: that high ISD might be an artifact of the genetic code. We find that the recency of de novo gene birth explains ∼ 32% of the elevation in ISD in overlapping regions of viral genes, with the rest attributed, by a process of elimination, to relieving constraint. While the two reading frames within a same-strand overlapping gene pair have markedly different ISD tendencies, their effects cancel out such that the properties of the genetic code do not contribute overall to elevated ISD. Same-strand overlapping gene birth events can occur in two different frames, favoring high ISD either in the ancestral gene or in the novel gene; surprisingly, most de novo gene birth events contained completely within the body of an ancestral gene favor high ISD in the ancestral gene (23 phylogenetically independent events vs. 1). This can be explained by mutation bias favoring the frame with more start codons and fewer stop codons.
The effectiveness of selection varies among species. It is often estimated by means of an 'effective population size' based on neutral polymorphism, but this is confounded in complex ways with demography. The strength of codon bias more directly pertains to how well adaptation at many sites can be maintained in the face of deleterious mutations, but past metrics that compare codon bias across species are confounded by among-species variation in %GC content and/or amino acid composition. Here we propose a new Codon Adaptation Index of Species (CAIS) that corrects for both confounders. Unlike previous metrics, CAIS yields the expected relationship with adult vertebrate body mass. As an example of the use of CAIS, we ask whether protein domains evolve lower intrinsic structural disorder (ISD) when present in more exquisitely adapted species, as expected given that ISD is higher in eukaryotic proteomes than prokaryotic proteomes. Using phylogenetically corrected linear models, we find, contrary to expectations, that the ISD of a given protein domain evolves to be higher when in well-adapted species. This effect is stronger in young protein domains but is also present in ancient domains.
Extant protein-coding sequences span a huge range of ages, from those that emerged only recently in particular lineages, to those present in the last universal common ancestor. Because evolution has had less time to act on young sequences, there might be "phylostratigraphy" trends in any properties that evolve slowly with age. Indeed, a long-term reduction in hydrophobicity and in hydrophobic clustering has been found in previous, taxonomically restricted studies. Here we perform integrated phylostratigraphy across 435 fully sequenced and dated eukaryotic species, using sensitive HMM methods to detect homology of protein domains (which may vary in age within the same gene), and applying a variety of quality filters. We find that the reduction in hydrophobic clustering is universal across diverse lineages, showing limited sign of saturation. But the tendency for young domains to have higher protein structural disorder, driven primarily by more hydrophilic amino acids, is found only among young animal domains, and not young plant domains, nor ancient domains predating the existence of the last eukaryotic common ancestor. Among ancient domains, trends in amino acid composition reflect the order of recruitment into the genetic code, suggesting that events during the earliest stages of life on earth continue to have an impact on the composition of ancient sequences.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.