Protein structure prediction without using templates (i.e., ab initio folding) is one of the most challenging problems in structural biology. In particular, conformation sampling poses as a major bottleneck of ab initio folding. This article presents CRFSampler, an extensible protein conformation sampler, built on a probabilistic graphical model Conditional Random Fields (CRFs). Using a discriminative learning method, CRFSampler can automatically learn more than ten thousand parameters quantifying the relationship among primary sequence, secondary structure, and (pseudo) backbone angles. Using only compactness and self-avoiding constraints, CRFSampler can efficiently generate protein-like conformations from primary sequence and predicted secondary structure. CRFSampler is also very flexible in that a variety of model topologies and feature sets can be defined to model the sequence-structure relationship without worrying about parameter estimation. Our experimental results demonstrate that using a simple set of features, CRFSampler can generate decoys with much higher quality than the most recent HMM model.
Growing concerns about the quality of aggregated biodiversity data are lowering trust in large-scale data networks. Aggregators frequently respond to quality concerns by recommending that biologists work with original data providers to correct errors 'at the source.' We show that this strategy falls systematically short of a full diagnosis of the underlying causes of distrust. In particular, trust in an aggregator is not just a feature of the data signal quality provided by the sources to the aggregator, but also a consequence of the social design of the aggregation process and the resulting power balance between individual data contributors and aggregators. The latter have created an accountability gap by downplaying the authorship and significance of the taxonomic hierarchies-frequently called 'backbones'-they generate, and which are in effect novel classification theories that operate at the core of data-structuring process. The Darwin Core standard for sharing occurrence records plays an under-appreciated role in maintaining the accountability gap, because this standard lacks the syntactic structure needed to preserve the taxonomic coherence of data packages submitted for aggregation, potentially leading to inferences that no individual source would support. Since high-quality data packages can mirror competing and conflicting classifications, i.e. unsettled systematic research, this plurality must be accommodated in the design of biodiversity data integration. Looking forward, a key directive is to develop new technical pathways and social incentives for experts to contribute directly to the validation of taxonomically coherent data packages as part of a greater, trustworthy aggregation process.
We use "data aggregation" to refer to merging multiple sets of data of the same kind (e.g., multiple collections of specimens or multiple runs of the same experiment) as distinct from "data integration," which refers to combining multiple kinds of data to solve an inference problem (Berman 2013). The limits of this distinction, where aggregation and integration become hard to tell apart, are an important topic outside the scope of this article.
We introduce a computational method to predict and annotate the catalytic residues of a protein using only its sequence information, so that we describe both the residues' sequence locations (prediction) and their specific biochemical roles in the catalyzed reaction (annotation). While knowing the chemistry of an enzyme's catalytic residues is essential to understanding its function, the challenges of prediction and annotation have remained difficult, especially when only the enzyme's sequence and no homologous structures are available. Our sequence-based approach follows the guiding principle that catalytic residues performing the same biochemical function should have similar chemical environments; it detects specific conservation patterns near in sequence to known catalytic residues and accordingly constrains what combination of amino acids can be present near a predicted catalytic residue. We associate with each catalytic residue a short sequence profile and define a Kullback-Leibler (KL) distance measure between these profiles, which, as we show, effectively captures even subtle biochemical variations. We apply the method to the class of glycohydrolase enzymes. This class includes proteins from 96 families with very different sequences and folds, many of which perform important functions. In a cross-validation test, our approach correctly predicts the location of the enzymes' catalytic residues with a sensitivity of 80% at a specificity of 99.4%, and in a separate cross-validation we also correctly annotate the biochemical role of 80% of the catalytic residues. Our results compare favorably to existing methods. Moreover, our method is more broadly applicable because it relies on sequence and not structure information; it may, furthermore, be used in conjunction with structure-based methods.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.