Sequence weighting techniques are aimed at balancing redundant observed information from subsets of similar sequences in multiple alignments. Traditional approaches apply the same weight to all positions of a given sequence, hence equal efficiency of phylogenetic changes is assumed along the whole sequence. This restrictive assumption is not required for the new method PSIC (position-specific independent counts) described in this paper. The number of independent observations (counts) of an amino acid type at a given alignment position is calculated from the overall similarity of the sequences that share the amino acid type at this position with the help of statistical concepts. This approach allows the fast computation of position-specific sequence weights even for alignments containing hundreds of sequences. The PSIC approach has been applied to profile extraction and to the fold family assignment of protein sequences with known structures. Our method was shown to be very productive in finding distantly related sequences and more powerful than Hidden Markov Models or the profile methods in WiseTools and PSI-BLAST in many cases. The profile extraction routine is available on the WWW (http://www.bork.embl-heidelberg. de/PSIC or http://www.imb.ac.ru/PSIC).
We calculated interchain contacts on the atomic level for nonredundant set of 4602 protein-protein interfaces using an unbiased Voronoi-Delaune tessellation method, and made 20x20 residue contact matrixes both for homodimers and heterocomplexes. The area of contacts and the distance distribution for these contacts were calculated on both the residue and the atomic levels. We analyzed residue area distribution and showed the existence of two types of interresidue contacts: stochastic and specific. We also derived formulas describing the distribution of contact area for stochastic and specific interactions in parametric form. Maximum pairing preference index was found for Cys-Cys contacts and for oppositely charged interactions. A significant difference in residue contacts was observed between homodimers and heterocomplexes. Interfaces in homodimers were enriched with contacts between residues of the same type due to the effects of structure symmetry.
Supplementary data are available at Bioinformatics online.
The parametric description of residue environments through solvent accessibility, backbone conformation, or pairwise residue-residue distances is the key to the comparison between amino acid types at protein sequence positions and residue locations in structural templates (condition of protein sequence-structure match). For the first time, the research results presented in this study clarify and allow to quantify, on a rigorous statistical basis, to what extent the amino acid type-specific distributions of commonly used environment parameters are discriminative with respect to the 20 amino acid types. Relying on the Bahadur theory, we estimate the probability of error in a single-sequence-structure alignment based on weak or absent discriminative power in a learning database of protein structure. We present the results for many residue environment variables and demonstrate that each fold description parameter is sensitive with respect to only a few amino acid types while indifferent to most of the other amino acid types. Even complex structural characteristics combining solvent-accessible surface area, backbone conformation, and pairwise distances distinguish only some amino acid types, whereas the others remain nondiscriminated. We find that the knowledge-based potentials currently in use treat especially Ala, Asp, Gln, His, Ser, Thr, and Tyr as essentially "average" amino acids. Thus, highly discriminative amino acid types define the alignment register in gapless sequence-structure alignments. The introduction of gaps leads to alignment ambiguities at sequence positions occupied by nondiscriminated amino acid types. Therefore, local sequence-structure alignments produced by techniques with gaps cannot be reliable. Conceptionally new and more sensitive environment parameters must be invented.
The assignment of query protein sequences to probable folds in a threading approach is based on the statistical analysis (learning) of structural properties of amino acids in known protein structures. We formalize the recognition problem in terms of mathematical statistics, namely statistical hypothesis testing. Our general formulation leads to various mathematical forms of a decision rule function for evaluation of the quality of a sequence-structure fit. Three criteria were derived according to a likelihood ratio approach. Two of them have new functional forms while the third happens to coincide with the mean force potential function previously derived under the additional assumption of the Boltzmann law. New decision rule functions employ (i) the Parzen estimator of a probability density and (ii) the newly introduced non-parametric statistic with known asymptotic distribution. We compared criteria efficiency by a 'structure seeks sequence' search for three highly populated template folds through a query library of non-homologous sequences of proteins with known 3D structure using residue accessibility as an environmental variable. Various criteria reflect different underlying statistical propositions and thus often recognize diverse correct sequence-structure matches. On the other hand, if an amino acid sequence is recognized as compatible with a template by each of three decision rules it appears that one can make a more reliable inference of sequence-structure relationship since almost all false positives obtained by the three criteria differ.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.