Of the currently identified protein sequences, 99.6% have never been observed in the laboratory as proteins and their molecular function has not been established experimentally. Predicting the function of such proteins relies mostly on annotated homologs. However, this has resulted in some erroneous annotations, and many proteins have no annotated homologs. Here we propose a de-novo function prediction approach based on identifying biophysical features that underlie function. Using our approach, we discover DNA and RNA binding proteins that cannot be identified based on homology and validate these predictions experimentally. For example, FGF14, which belongs to a family of secreted growth factors was predicted to bind DNA. We verify this experimentally and also show that FGF14 is localized to the nucleus. Mutating the predicted binding site on FGF14 abrogated DNA binding. These results demonstrate the feasibility of automated de-novo function prediction based on identifying function-related biophysical features.
Diffuse large B-cell lymphoma (DLBCL) is a complex and aggressive malignancy. The standard-of-care chemo-immunotherapeutic regimen, which consists of R-CHOP (rituximab, cyclophosphamide, doxorubicin, vincristine, and prednisone), leads to a complete response in most patients. Unfortunately, 30-40% of patients are either refractory to the current treatment regimen or experience disease relapse after complete response, and thus, these patients exhibit a dismal prognosis. Gene expression profiling has delineated two distinct molecular subtypes of DLBCL, the germinal center B-cell-like (GCB) subtype and the activated B-cell-like (ABC) subtype; 10 to 15% of cases are unclassifiable, these groups differ in survival, and could potentially direct therapy (Alizadeh, Nature 2000). A study by Schmitz et al have used whole exome and transcriptome of 574 DLBCL tumors to refine the abovementioned genetic subtypes, in attempt to improve prognostic capacity, focusing on protein coding genes (Schmitz, N Engl J Med. 2018) . Non-coding RNAs (ncRNAs) has been shown to be differentially expressed and clustered across different groups of DLBCL samples, suggesting that these ncRNAs may also participate in DLBCL pathogenesis (Shi, OncoTargets and therapy 2020) . In this study, we applied machine learning methods for classification of DLBCL genetic subgroups based on ncRNA expression, and proposing a clinical-genetic survival predictive model. Out of 1866 ncRNAs from the study by Schmitz et al, 377 were selected using information gain algorithm, and were used to classify 234 DLBCL tumors to the different genetic subgroups (ABC, GCB, Unclassified). Classification models were trained using K Nearest Neighbor (KNN) (K=5), decision tree, random forest and multilayer perceptron algorithms leading to a weighted area under the ROC curve of 0.895, 0.749, 0.924 and 0.965 respectively. Using the information gain algorithm, we identified 28 ncRNAs which have an information gain score of >0 in classifying patients to either having achieved survival of three years or not. Of these, seven ncRNAs were found to have a significant correlation to overall survival (OS) (p<0.05 for each) using cox regression survival analysis. In multivariate analysis, including age, gender, ECOG, IPI, genetic subgroups and these seven ncRNAs, we found only age and three ncRNAs (NR_026893, NR_002939, NR_002186) to be significantly associated with OS, figure 1A. We performed Kaplan Mayer analysis using these three genes as binary variables (medians were used for cutoff), dividing the cohort into three groups (all three ncRNA up regulated, one/two down regulated and all three ncRNAs down regulated) with robust difference in overall survival (median OS was not reached, 5.5 CI 95% (1.1-9.9) years 1.5 CI 95% (0.4-2.7) years, respectively), as presented in figure 1B. In conclusion, we detected novel diagnostic and prognostic ncRNAs biomarkers which potentially be able to inform clinical management for patients with DLBCL. Further studying of ncRNAs expression profile and cellular mechanisms could help improve our understanding of the disease and potentially identify new therapeutic targets and support the development of new therapies. Figure 1 Figure 1. Disclosures Avivi: Kite, a Gilead Company: Speakers Bureau; Novartis: Speakers Bureau. Cohen: Karophram: Membership on an entity's Board of Directors or advisory committees, Research Funding; GSK: Consultancy, Membership on an entity's Board of Directors or advisory committees; Amgen: Membership on an entity's Board of Directors or advisory committees, Research Funding; Janssen: Membership on an entity's Board of Directors or advisory committees; Takeda: Membership on an entity's Board of Directors or advisory committees, Research Funding; Neopharm / promedico: Consultancy, Membership on an entity's Board of Directors or advisory committees, Research Funding.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.