Summary The sequence specific recognition of DNA by regulatory proteins typically occurs by establishing hydrogen bonds and non-bonded contacts between chemical sub-structures of nucleotides and amino acids forming the compatible interacting surfaces. The recognition process is also influenced by the physicochemical and conformational character of the target oligonucleotide motif. Although the role of these mechanisms in DNA-protein interactions is well-established, bioinformatical methods rarely address them directly, instead binding specificity is mostly assessed at nucleotide level. DNA Readout Viewer (DRV) aims to provide a novel DNA representation, facilitating in-depth view into these mechanisms by the concurrent visualization of functional groups and a diverse collection of DNA descriptors. By applying its intuitive representation concept for various DNA recognition related visualization tasks, DRV can contribute to unravelling the binding specificity factors of DNA-protein interactions. Availability and implementation DRV is freely available at https://drv.brc.hu. Supplementary information Supplementary data are available at Bioinformatics online.
Transcription factors (TFs) play an essential role in molecular biology by regulating gene expression. The binding sites of TFs can vary by a large amount and the numerous possible binding locations make their detection a challenging issue. Recently, several machine learning approaches using nucleotide sequence data were applied to classify DNA sequences regarding Transcription Factor Binding Sites (TFBS). We propose a novel training strategy without the traditional 1D nucleotide-based DNA sequence representation by instead using a 2D topological matrix of sub-nucleotide chemical functional groups substantially defining the protein binding ability of DNA fragments. We train convolutional neural networks using this novel Functional Group DNA Representation (FGDR) to solve a TFBS classification task. We compare our results with the efficiency of previous nucleotide-based training approaches and show that learning from an FGDR data sequence has several benefits regarding TFBS classification. Moreover, we reason that learning deep neural networks from the FGDR representation produces competitive results while only introducing a pre-processing conversion step. Finally, we show that employing an ensemble of models from the nucleotide and FGDR representations for network training results in higher classification performance than any of the single input approaches.
Transcription Factors (TFs) are one of the most important agents acting on gene expression regulation, fundamentally determining the organized functional operation of cellular machinery. At a molecular level, this effect is achieved by the sequence specific physical binding of TF proteins to particular parts of the DNA. Transcription Factors regulate gene expression in complex ways and the detection of their binding sites is an important part of many experiments. Predicting Transcription Factor Binding Sites (TFBS) from DNA sequence data has been a challenging task in the field of bioinformatics. The abundance of available DNA sequences strongly encourages the use of machine learning for this problem. Until now most of these efforts were primarily based on the traditional nucleotide-based representation of DNA. To elaborate a more detailed description of this macromolecule, we have worked out a new Physico-Chemical Descriptor (PCD) based DNA representation and used it as input for training neural networks to predict TFBSs. We show that the PCD representation is a viable format for deep learning models, and our feature selection investigation highlights the importance of proper PCD subset choices. The distinct prediction efficiencies detected upon the usage of arbitrarily selected feature subsets indicates that the different DNA features affect the DNA binding process of TFs to various extent.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.