MotivationRapid and accurate identification of transmembrane (TM) topology is well suited for the annotation of the entire membrane proteome. It is the initial step of predicting the structure and function of membrane proteins. However, existing methods that utilize only amino acid sequence information suffer from low prediction accuracy, whereas methods that exploit sequence profile or consensus need too much computational time.
MethodHere we propose a deep learning framework DeepCNF that predicts TM topology from amino acid sequence only. Compared to previous sequence-based approaches that use hidden Markov models or dynamic Bayesian networks, DeepCNF is able to incorporate much more contextual information by a hierarchical deep neural network, while simultaneously modeling the interdependency between adjacent topology labels.
ResultExperimental results show that PureseqTM not only outperforms existing sequence-based methods, but also reaches or even surpasses the profile/consensus methods. On the 39 newly released membrane proteins, our approach successfully identifies the correct TM segments and boundaries for at least 3 cases while all existing methods fail to do so. When applied to the entire human proteome, our method can identify the incorrect annotations of TM regions by UniProt and discover the membrane-related proteins that are not manually curated as membrane proteins.Availability http://pureseqtm.predmp.com/ =============================================================================== =========== Introduction: =========== Transmembrane proteins (TMPs) are key players in energy production, material transport, and communication between cells [1]. TMPs are encoded by ~30% genes in the various genomes [2] and have been targeted by ~50% of therapeutic drugs [3]. Despite their abundance and importance, the number of solved TMPs structures is relatively low compared to that of non-transmembrane proteins (non-TMPs). In particular, under the 40% sequence identity cutoff, there are only about 1500 non-redundant TMPs whereas the number of non-redundant non-TMPs is more than 34000. The underlying reason is that the experimental determination of TMPs is challenging as membrane proteins are often too large for NMR spectroscopy and difficult to be crystallized for X-ray crystallography [4]. Thus, it is critical to develop computational methods for the prediction of TMP structures from amino acid sequences, and the initial step is the accurate identification of the transmembrane topology [5].As shown in the left part of Figure 1, transmembrane (TM) topology refers to the locations of the membrane-spanning segments, which could be represented as a 1D 0/1 string to indicate the location of each residue to reside in (label 1) or out of (label 0) the membrane. This simple but direct definition of TM topology is consistent with the 3-label definition used by many other works that divide non-TM regions (i.e., label 0) into inner or outer classes [6][7][8][9][10]. In this work, we only focus on the prediction of TM topol...