Over the past decade, evidence has accumulated that new protein‐coding genes can emerge de novo from previously non‐coding DNA. Most studies have focused on large scale computational predictions of de novo protein‐coding genes across a wide range of organisms. In contrast, experimental data concerning the folding and function of de novo proteins are scarce. This might be due to difficulties in handling de novo proteins in vitro, as most are short and predicted to be disordered. Here, we propose a guideline for the effective expression of eukaryotic de novo proteins in Escherichia coli. We used 11 sequences from Drosophila melanogaster and 10 from Homo sapiens, that are predicted de novo proteins from former studies, for heterologous expression. The candidate de novo proteins have varying secondary structure and disorder content. Using multiple combinations of purification tags, E. coli expression strains, and chaperone systems, we were able to increase the number of solubly expressed putative de novo proteins from 30% to 62%. Our findings indicate that the best combination for expressing putative de novo proteins in E. coli is a GST‐tag with T7 Express cells and co‐expressed chaperones. We found that, overall, proteins with higher predicted disorder were easier to express. Statement Today, we know that proteins do not only evolve by duplication and divergence of existing proteins but also arise from previously non‐coding DNA. These proteins are called de novo proteins. Their properties are still poorly understood and their experimental analysis faces major obstacles. Here, we aim to present a starting point for soluble expression of de novo proteins with the help of chaperones and thereby enable further characterization.
De novo genes are novel genes which emerge from non-coding DNA. Until now, little is known about de novo genes’ properties, correlated to their age and mechanisms of emergence. In this study, we investigate four related properties: introns, upstream regulatory motifs, 5′ Untranslated regions (UTRs) and protein domains, in 23,135 human proto-genes. We found that proto-genes contain introns, whose number and position correlates with the genomic position of proto-gene emergence. The origin of these introns is debated, as our results suggest that 41% of proto-genes might have captured existing introns, and 13.7% of them do not splice the ORF. We show that proto-genes which emerged via overprinting tend to be more enriched in core promotor motifs, while intergenic and intronic genes are more enriched in enhancers, even if the TATA motif is most commonly found upstream in these genes. Intergenic and intronic 5′ UTRs of proto-genes have a lower potential to stabilise mRNA structures than exonic proto-genes and established human genes. Finally, we confirm that proteins expressed by proto-genes gain new putative domains with age. Overall, we find that regulatory motifs inducing transcription and translation of previously non-coding sequences may facilitate proto-gene emergence. Our study demonstrates that introns, 5′ UTRs, and domains have specific properties in proto-genes. We also emphasize that the genomic positions of de novo genes strongly impacts these properties.
De novo genes are novel genes which emerge from non-coding DNA. Until now, little is known about de novo genes properties, correlated to their age and mechanisms of emergence. In this study, we investigate four properties: introns, upstream regulatory motifs, 5 prime UTRs and protein domains, in 23135 human proto-genes. We found that proto-genes contain introns, whose number and position correlates with the genomic position of proto-gene emergence. The origin of these introns is debated, as our result suggest that 41% proto-genes might have captured existing introns, as well as the fact that 13.7% of them do not splice the ORF. We show that proto-genes which emerged via overprinting tend to be more enriched in core promotor motifs, while intergenic and intronic ones are more enriched in enhancers, even if the motif TATA is most expressed upstream these genes. Intergenic and intronic 5 prime UTRs of proto-genes have a lower potential to stabilise mRNA structures than exonic proto-genes and established human genes. Finally, we confirm that proto-genes gain new putative domains with age. Overall, we find that regulatory motifs inducing transcription and translation of previously non-coding sequences may facilitate proto-gene emergence. Our paper demonstrates that introns, 5 prime UTRs, and domains have specific properties in proto-genes. We also show the importance of studying proto-genes in relation to their genomic position, as it strongly impacts these properties.
Soil salinity and the resulting salt stress it imposes on crop plants is a major problem for modern agriculture. Understanding how salt tolerance mechanisms in plants are regulated is therefore important. One regulatory mechanism is the APETALA2/Ethylene Responsive Factor (AP2/ERF) transcription factor family, including dehydration responsive element binding (DREB) transcription factors. By binding to DNA, specifically upstream of genes that play roles in salt tolerance pathways, DREB proteins upregulate expression of these genes. DREB in Triticum aestivum (wheat) cluster in subgroups and in this study by scanning the recently extended predicted proteome of wheat for DREB, we increased the number of members of this sub-family. Using the wheat genome, we identified 576 genes coding for the AP2 domain of which 508 were identified to have one AP2 domain, a characteristic of the DREB/ ERF subfamily. We confirmed the existing four subgroups by sequence-based phylogenetic analyses but also identified 32 new DREB subfamily members, not belonging to any known subgroup. Transcription factor profile inference analysis identified two genes, TraesCS2B02G002700 and TraesCS2D02G015200, being homologous to DREB1A of Arabidopsis thaliana. Based on molecular simulation (25 ns) analysis, TraesCS2B02G002700 with a CCGAC motif was observed to interact very stably with DNA. In silico mutational analysis at the 19 th position in the DREB domain of TraesCS2B02G002700-DNA complex indicated this as a stable part for recognizing and forming interaction with DNA. Moreover, six target genes were predicted having an upstream CCGAC motif regulated by TraesCS2B02G002700. Our study provides an overall framework for exploring the transcription factors in plants and identifying e.g. potential salt stress target genes.
Over the past decade, evidence has accumulated that new protein-coding genes can emerge de novo from previously non-coding DNA. Most studies have focused on large-scale computational predictions of de novo protein coding genes across a wide range of organisms. In contrast, experimental data concerning the folding and function of de novo proteins is scarce. This might be due to difficulties in handling de novo proteins in vitro, as most are predicted to be short and disordered. Here we propose a guideline for the effective expression of eukaryotic de novo proteins in Escherichia coli.We used 11 sequences from Drosophila melanogaster and 10 from Homo sapiens, that are predicted de novo proteins from former studies, for heterologous expression. The candidate de novo proteins have varying secondary structure and disorder content. Using multiple combinations of purification tags, E. coli expression strains and chaperone systems, we were able to increase the number of solubly expressed putative de novo proteins from 30 % to 62 %. Our findings indicate that the best combination for expressing putative de novo proteins in E. coli is a GST-tag with T7 Express cells and co-expressed chaperones. We found that, overall, proteins with higher predicted disorder were easier to express.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.