Large-scale integrated cancer genome characterization efforts including the cancer genome atlas and the cancer cell line encyclopedia have created unprecedented opportunities to study cancer biology in the context of knowing the entire catalog of genetic alterations. A clinically important challenge is to discover cancer subtypes and their molecular drivers in a comprehensive genetic context. Curtis et al. [Nature (2012) 486(7403):346-352] has recently shown that integrative clustering of copy number and gene expression in 2,000 breast tumors reveals novel subgroups beyond the classic expression subtypes that show distinct clinical outcomes. To extend the scope of integrative analysis for the inclusion of somatic mutation data by massively parallel sequencing, we propose a framework for joint modeling of discrete and continuous variables that arise from integrated genomic, epigenomic, and transcriptomic profiling. The core idea is motivated by the hypothesis that diverse molecular phenotypes can be predicted by a set of orthogonal latent variables that represent distinct molecular drivers, and thus can reveal tumor subgroups of biological and clinical importance. Using the cancer cell line encyclopedia dataset, we demonstrate our method can accurately group cell lines by their cell-of-origin for several cancer types, and precisely pinpoint their known and potential cancer driver genes. Our integrative analysis also demonstrates the power for revealing subgroups that are not lineage-dependent, but consist of different cancer types driven by a common genetic alteration. Application of the cancer genome atlas colorectal cancer data reveals distinct integrated tumor subtypes, suggesting different genetic pathways in colon cancer progression.A major goal of many cancer genome projects is to characterize key genetic alterations in cancer and discover therapeutic targets through comprehensive genomic profiling of the cancer genome. The Cancer Genome Atlas (TCGA) studies have unveiled the genetic landscape of several cancer types by whole-genome and whole-exome sequencing, DNA copy number profiling, promoter methylation profiling, and mRNA expression profiling in a large number of tumors (1-5). Complementary to the tumor project, the Cancer Cell Line Encyclopedia (CCLE) (6) and the Sanger cell line project (7) has cataloged a compilation of genetic and molecular data in almost 1,000 human cancer cell lines, coupled with pharmacological profiles for a large panel of anticancer drugs. These large-scale integrative genomic efforts have been geared toward comprehensively cataloging individual genomic alterations, analogous to a reverse-engineering process where thousands of individual cancer genomes are taken apart to shed light on common biological principles. Unfortunately, cancer genomes exhibit considerable heterogeneity with abnormalities occurring in different genes among different individuals, posing a great challenge to identify those genes with functional importance and therapeutic implications. Thus, there is a...
In many social, economical, biological and medical studies, one objective is to classify a subject into one of several classes based on a set of variables observed from the subject. Because the probability distribution of the variables is usually unknown, the rule of classification is constructed using a training sample. The well-known linear discriminant analysis (LDA) works well for the situation where the number of variables used for classification is much smaller than the training sample size. Because of the advance in technologies, modern statistical studies often face classification problems with the number of variables much larger than the sample size, and the LDA may perform poorly. We explore when and why the LDA has poor performance and propose a sparse LDA that is asymptotically optimal under some sparsity conditions on the unknown parameters. For illustration of application, we discuss an example of classifying human cancer into two classes of leukemia based on a set of 7,129 genes and a training sample of size 72. A simulation is also conducted to check the performance of the proposed method.
The rapid increase in the prevalence of chronic heart failure (CHF) worldwide underscores an urgent need to identify biomarkers for the early detection of CHF. Post-translational modifications (PTMs) are associated with many critical signaling events during disease progression and thus offer a plethora of candidate biomarkers. We have employed top-down quantitative proteomics methodology for comprehensive assessment of PTMs in whole proteins extracted from normal and diseased tissues. We have systematically analyzed thirty-six clinical human heart tissue samples and identified phosphorylation of cardiac troponin I (cTnI) as a candidate biomarker for CHF. The relative percentages of the total phosphorylated cTnI forms over the entire cTnI populations (%Ptotal) were 56.4±3.5%, 36.9±1.6%, 6.1±2.4%, and 1.0±0.6% for postmortem hearts with normal cardiac function (n=7), early-stage of mild hypertrophy (n=5), severe hypertrophy/dilation (n=4), and end-stage CHF (n=6), respectively. In fresh transplant samples, the %Ptotal of cTnI from non-failing donor (n=4), and end-stage failing hearts (n=10) were 49.5±5.9% and 18.8±2.9%, respectively. Top-down MS with electron capture dissociation unequivocally localized the altered phosphorylation sites to Ser22/23 and determined the order of phosphorylation/dephosphorylation. This study represents the first clinical application of top-down MS-based quantitative proteomics for biomarker discovery from tissues, highlighting the potential of PTM as disease biomarkers.
We propose a computationally intensive method, the random lasso method, for variable selection in linear models. The method consists of two major steps. In step 1, the lasso method is applied to many bootstrap samples, each using a set of randomly selected covariates. A measure of importance is yielded from this step for each covariate. In step 2, a similar procedure to the first step is implemented with the exception that for each bootstrap sample, a subset of covariates is randomly selected with unequal selection probabilities determined by the covariates’ importance. Adaptive lasso may be used in the second step with weights determined by the importance measures. The final set of covariates and their coefficients are determined by averaging bootstrap results obtained from step 2. The proposed method alleviates some of the limitations of lasso, elastic-net and related methods noted especially in the context of microarray data analysis: it tends to remove highly correlated variables altogether or select them all, and maintains maximal flexibility in estimating their coefficients, particularly with different signs; the number of selected variables is no longer limited by the sample size; and the resulting prediction accuracy is competitive or superior compared to the alternatives. We illustrate the proposed method by extensive simulation studies. The proposed method is also applied to a Glioblastoma microarray data analysis.
Summary. Variable selection in high-dimensional clustering analysis is an important yet challenging problem. In this article, we propose two methods that simultaneously separate data points into similar clusters and select informative variables that contribute to the clustering. Our methods are in the framework of penalized model-based clustering. Unlike the classical L 1 -norm penalization, the penalty terms that we propose make use of the fact that parameters belonging to one variable should be treated as a natural "group." Numerical results indicate that the two new methods tend to remove noninformative variables more effectively and provide better clustering results than the L 1 -norm approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.