ABSTRACTing gap between the number of known protein sequences and the number of known structures.
Predicted relative solvent accessibility (RSA)Despite several decades of extensive research in terprovides useful information for prediction of tiary structure prediction, this task is still a big chalbinding sites and reconstruction of the 3D-lenge, especially for sequences that do not have a sigs t r u c t u r e b a s e d o n a p r o t e i n s e q u e n c e .nificant sequence similarity with known structures Recent years observed development of sev- [1]. As a result, the predictions of the solvent accessieral RSA prediction methods including those b i l i t y [ 2 ] a n d t h e s e c o n d a r y s t r u c t u r e [ 3 ] a r e that generate real values and those that preaddressed as an intermediate step towards the predicdict discrete states (buried vs. exposed). We tion of the tertiary structure. The relative solvent propose a novel method for real value predicaccessibility (RSA) reflects the degree to which a restion that aims at minimizing the prediction idue interacts with the solvent molecules. Since proerror when compared with six existing methtein-protein and protein-ligand interactions occur at ods. The proposed method is based on a twothe protein surface, only the residues that have a stage Support Vector Regression (SVR) prelarge surface area exposed to the solvent can possibly dictor. The improved prediction quality is a bind to the ligands and other proteins. As a result, preresult of the developed composite sequence diction of solvent accessibility provides useful inforrepresentation, which includes a custommation for prediction of binding sites [4] and is selected subset of features from the PSIvitally important for understanding the binding mech-BLAST profile, secondary structure preanism of proteins [5]. Chan and Dill pointed that the dicted with PSI-PRED, and binary code that burial of core residues is the driving force in protein indicates position of a given residue with folding, which suggests that knowledge of localizarespect to sequence termini. Cross validation of individual residues (surface vs. buried) protion tests on a benchmark dataset show that vides useful information to reconstruct the 3D-our method achieves 14.3 mean absolute structure of proteins [6][7][8]. error and 0.68 correlation. We also propose aThe existing solvent accessibility prediction methconfidence value that is associated with each ods use the protein sequence, which is converted into predicted RSA values. The confidence is com-a fixed-size feature-based representation, as an input puted based on the difference in predictions to predict the RSA for each of the residues. These from the two-stage SVR and a second two-methods can be divided into two main groups: stage Linear Regression (LR) predictor. TheReal valued predictors predict RSA value (the confidence values can be used to indicate definition is given in the Materials section