Machine learning methods are often used to classify objects described by hundreds of attributes; in many applications of this kind a great fraction of attributes may be totally irrelevant to the classification problem. Even more, usually one cannot decide a priori which attributes are relevant. In this paper we present an improved version of the algorithm for identification of the full set of truly important variables in an information system. It is an extension of the random forest method which utilises the importance measure generated by the original algorithm. It compares, in the iterative fashion, the importances of original attributes with importances of their randomised copies. We analyse performance of the algorithm on several examples of synthetic data, as well as on a biologically important problem, namely on identification of the sequence motifs that are important for aptameric activity of short RNA sequences.
Chromatin topology is intricately linked to gene expression, yet its functional requirement remains unclear. Here, we comprehensively assessed the interplay between genome topology and gene expression using highly rearranged chromosomes (balancers) spanning ~75% of the Drosophila genome. Using transheterozyte (balancer/wild-type) embryos, we measured allele-specific changes in topology and gene expression in cis , whilst minimizing trans effects. Through genome sequencing, we resolved eight large nested inversions, smaller inversions, duplications, and thousands of deletions. These extensive rearrangements caused many changes to chromatin topology, including long-range loops, TADs and promoter interactions, yet these are not predictive of changes in expression. Gene expression is generally not altered around inversion breakpoints, indicating that mis-appropriate enhancer-promoter activation is a rare event. Similarly, shuffling or fusing TADs, changing intra-TAD connections and disrupting long-range inter-TAD loops, does not alter expression for the majority of genes. Our results suggest that properties other than chromatin topology ensure productive enhancer-promoter interactions.
The SOXE transcription factors SOX8, SOX9 and SOX10 are master regulators of mammalian development directing sex determination, gliogenesis, pancreas specification and neural crest development. We identified a set of palindromic SOX binding sites specifically enriched in regulatory regions of melanoma cells. SOXE proteins homodimerize on these sequences with high cooperativity. In contrast to other transcription factor dimers, which are typically rigidly spaced, SOXE group proteins can bind cooperatively at a wide range of dimer spacings. Using truncated forms of SOXE proteins, we show that a single dimerization (DIM) domain, that precedes the DNA binding high mobility group (HMG) domain, is sufficient for dimer formation, suggesting that DIM : HMG rather than DIM:DIM interactions mediate the dimerization. All SOXE members can also heterodimerize in this fashion, whereas SOXE heterodimers with SOX2, SOX4, SOX6 and SOX18 are not supported. We propose a structural model where SOXE-specific intramolecular DIM:HMG interactions are allosterically communicated to the HMG of juxtaposed molecules. Collectively, SOXE factors evolved a unique mode to combinatorially regulate their target genes that relies on a multifaceted interplay between the HMG and DIM domains. This property potentially extends further the diversity of target genes and cell-specific functions that are regulated by SOXE proteins.
The binding of transcription factors (TFs) to their specific motifs in genomic regulatory regions is commonly studied in isolation. However, in order to elucidate the mechanisms of transcriptional regulation, it is essential to determine which TFs bind DNA cooperatively as dimers and to infer the precise nature of these interactions. So far, only a small number of such dimeric complexes are known. Here, we present an algorithm for predicting cell-type-specific TF-TF dimerization on DNA on a large scale, using DNase I hypersensitivity data from 78 human cell lines. We represented the universe of possible TF complexes by their corresponding motif complexes, and analyzed their occurrence at cell-type-specific DNase I hypersensitive sites. Based on~1.4 billion tests for motif complex enrichment, we predicted 603 highly significant celltype-specific TF dimers, the vast majority of which are novel. Our predictions included 76% (19/25) of the known dimeric complexes and showed significant overlap with an experimental database of protein-protein interactions. They were also independently supported by evolutionary conservation, as well as quantitative variation in DNase I digestion patterns. Notably, the known and predicted TF dimers were almost always highly compact and rigidly spaced, suggesting that TFs dimerize in close proximity to their partners, which results in strict constraints on the structure of the DNA-bound complex. Overall, our results indicate that chromatin openness profiles are highly predictive of cell-type-specific TF-TF interactions. Moreover, cooperative TF dimerization seems to be a widespread phenomenon, with multiple TF complexes predicted in most cell types.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.