Photosynthesis by which plants convert carbon dioxide to sugars using the energy of light is fundamental to life as it forms the basis of nearly all food chains. Surprisingly, our knowledge about its transcriptional regulation remains incomplete. Effort for its agricultural optimization have mostly focused on post-translational regulatory processes1-3 but photosynthesis is regulated at the post-transcriptional4 and the transcriptional level5. Stacked transcription factor mutants remain photosynthetically active5,6 and additional transcription factors have been difficult to identify possibly due to redundancy6 or lethality. Using a random forest decision tree-based machine learning approach for gene regulatory network calculation7 we determined ranked candidate transcription factors and validated five out of five tested transcription factors as controlling photosynthesis in vivo. The detailed analyses of previously published and newly identified transcription factors suggest that photosynthesis is transcriptionally regulated in a partitioned, non-hierarchical, interlooped network.
Understanding gene expression will require understanding where regulatory factors bind genomic DNA. The frequently used sequence-based motifs of protein-DNA binding are not predictive, since a genome contains many more binding sites than are actually bound and transcription factors of the same family share similar DNA-binding motifs. Traditionally, these motifs only depict sequence but neglect DNA shape. Since shape may contribute non-linearly and combinational to binding, machine learning approaches ought to be able to better predict transcription factor binding. Here we show that a random forest machine learning approach, which incorporates the 3D-shape of DNA, enhances binding prediction for all 216 tested Arabidopsis thaliana transcription factors and improves the resolution of differential binding by transcription factor family members which share the same binding motif. We observed that DNA shape features were individually weighted for each transcription factor, even if they shared the same binding sequence.
Cardiovascular diseases are the number one cause of morbidity and mortality worldwide, but the underlying molecular mechanisms remain not well understood. Cardiomyopathies are primary diseases of the heart muscle and contribute to high rates of heart failure and sudden cardiac deaths. Here, we distinguished four different genetic cardiomyopathies based on gene expression signatures. In this study, RNA-Sequencing was used to identify gene expression signatures in myocardial tissue of cardiomyopathy patients in comparison to non-failing human hearts. Therefore, expression differences between patients with specific affected genes, namely LMNA (lamin A/C), RBM20 (RNA binding motif protein 20), TTN (titin) and PKP2 (plakophilin 2) were investigated. We identified genotype-specific differences in regulated pathways, Gene Ontology (GO) terms as well as gene groups like secreted or regulatory proteins and potential candidate drug targets revealing specific molecular pathomechanisms for the four subtypes of genetic cardiomyopathies. Some regulated pathways are common between patients with mutations in RBM20 and TTN as the splice factor RBM20 targets amongst other genes TTN, leading to a similar response on pathway level, even though many differentially expressed genes (DEGs) still differ between both sample types. The myocardium of patients with mutations in LMNA is widely associated with upregulated genes/pathways involved in immune response, whereas mutations in PKP2 lead to a downregulation of genes of the extracellular matrix. Our results contribute to further understanding of the underlying molecular pathomechanisms aiming for novel and better treatment of genetic cardiomyopathies.
A genome encodes two types of information, the "what can be made" and the "when and where". The "what" are mostly proteins which perform the majority of functions within living organisms and the "when and where" is the regulatory information that encodes when and where proteins are made. Currently, it is possible to efficiently predict the majority of the protein content of a genome but nearly impossible to predict the transcriptional regulation. This regulation is based upon the interaction between transcription factors and genomic sequences at the site of binding motifs1,2,3. Information contained within the motif is necessary to predict transcription factor binding, however, it is not sufficient4. Peaks detected in amplified DNA affinity purification sequencing (ampDAP-seq) and the motifs derived from them only partially overlap in the genome3 indicating that the sequence holds information beyond the binding motif. Here we show a random forest machine learning approach which incorporates the 3D-shape improved the area under the precision recall curve for binding prediction for all 216 tested Arabidopsis thaliana transcription factors. The method resolved differential binding of transcription factor family members which share the same binding motif. The models correctly predicted the binding behavior of novel, not-in-genome motif sequences. Understanding transcription factor binding as a combination of motif sequence and motif shape brings us closer to predicting gene expression from promoter sequence.
Identification of plasmids from sequencing data is an important and challenging problem related to antimicrobial resistance spread and other One-Health issues. In our work, we provide a new architecture for identifying plasmid contigs in fragmented genome assemblies built from short-read data. Unlike previous machine-learning approaches for this problem, which classify individual contigs separately, we employ graph neural networks (GNNs) to include information from the assembly graph. Propagation of information from nearby nodes in the graph allows accurate classification of even short contigs that are difficult to classify based on sequence features or database searches alone. Our new species-agnostic software tool plASgraph outperforms recently developed PlasForest, which uses database searches to supplement sequence-based features. Since our tool does not rely on existing plasmid databases, it is more suitable for classification of contigs in novel species and discovery of previously unknown plasmid sequences. Our tool can also be trained on a specific species, and in that scenario it outperforms mlplasmids trained on the same species. On one hand, our work provides a new, accurate, and easy to use tool for plasmid classification; on the other hand, it serves as a motivation for more widespread use of GNNs in bioinformatics, such as in pangenome sequence analysis, where sequence graphs serve as a fundamental data structure.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.