Background: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora.
In this paper, we introduce a framework for regularized least-squares (RLS) type of ranking cost functions and we propose three such cost functions. Further, we propose a kernel-based preference learning algorithm, which we call RankRLS, for minimizing these functions. It is shown that RankRLS has many computational advantages compared to the ranking algorithms that are based on minimizing other types of costs, such as the hinge cost. In particular, we present efficient algorithms for training, parameter selection, multiple output learning, cross-validation, and large-scale learning. Circumstances under which these computational benefits make RankRLS preferable to RankSVM are considered. We evaluate RankRLS on four different types of ranking tasks using RankSVM and the standard RLS regression as the baselines. RankRLS outperforms the standard RLS regression and its performance is very similar to that of RankSVM, while RankRLS has several computational benefits over RankSVM.
Reliable structural and statistical analyses of three dimensional protein structures should be based on unbiased data. The Protein Data Bank is highly redundant, containing several entries for identical or very similar sequences. A technique was developed for clustering the known structures based on their sequences and contents of alpha- and beta-structures. First, sequences were aligned pairwise. A representative sample of sequences was then obtained by grouping similar sequences together, and selecting a typical representative from each group. The similarity significance threshold needed in the clustering method was found by analyzing similarities of random sequences. Because three dimensional structures for proteins of same structural class are generally more conserved than their sequences, the proteins were clustered also according to their contents of secondary structural elements. The results of these clusterings indicate conservation of alpha- and beta-structures even when sequence similarity is relatively low. An unbiased sample of 103 high resolution structures, representing a wide variety of proteins, was chosen based on the suggestions made by the clustering algorithm. The proteins were divided into structural classes according to their contents and ratios of secondary structural elements. Previous classifications have suffered from subjective view of secondary structures, whereas here the classification was based on backbone geometry. The concise view lead to reclassification of some structures. The representative set of structures facilitates unbiased analyses of relationships between protein sequence, function, and structure as well as of structural characteristics.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.