Simplified and unified access to cancer proteogenomic data

Lindgren, Caleb M.; Adams, David W.; Kimball, Benjamin; Boekweg, Hannah; Tayler, Sadie; Pugh, Samuel L.; Payne, Samuel H.

doi:10.1101/2020.11.16.385427

Cited by 7 publications

(8 citation statements)

References 23 publications

(33 reference statements)

Supporting

Mentioning

Contrasting

Order By: Relevance

“…CPTAC RNA-seq and mass spectrometry datasets for breast (Krug et al, 2020), ovarian (Hu et al, 2020b;Zhang et al, 2016), colorectal (Vasaikar et al, 2019;Zhang et al, 2014), lung adenocarcinoma (Gillette et al, 2020), and endometrial (Dou et al, 2020) cancer discovery studies were retrieved in accordance with the CPTAC data use and embargo policies using the cptac v.0.9.1 package in Python 3.9. Statistical learning was performed using scikit-learn 0.24.2 (Lindgren et al, 2021). Transcriptomics data were standardized, after which data were split 80/20 into train and test sets.…”

Section: Retrieval and Analysis Of Public Expression Data Setsmentioning

confidence: 99%

Transcriptome features of striated muscle aging and predictability of protein level changes

Kastury

et al. 2021

Mol. Omics

View full text Add to dashboard Cite

show abstract

Section: Retrieval and Analysis Of Public Expression Data Setsmentioning

confidence: 99%

Transcriptome features of striated muscle aging and predictability of protein level changes

Kastury

et al. 2021

Mol. Omics

View full text Add to dashboard Cite

show abstract

“…We performed the analysis using expression data from TCGA/CPTAC RNA sequencing experiments through the cptac Python API, which retrieves the final data tables from the flagship CPTAC papers of each individual cancer type 25 . Although each TCGA/CPTAC cancer subtype project follows an overall consistent experimental design and data acquisition strategy, minute differences exist in the processing pipelines usedM to analyze the RNA sequencing data (e.g., STAR vs. Bowtie2) and gene expression measure (e.g., RPKM vs. FPKM) which could bias gene expression values across cancer types.…”

Section: Discussionmentioning

confidence: 99%

“…The cumulative inclusions of each cancer type in the order above are sequentially referred to as CPTAC_2 to CPTAC_8 in the manuscript, such that CPTAC_2 refers to the union of ovarian and breast cancer (OV + BR); CPTAC_3 refers the union of ovarian, breast, and endometrial cancer (OV + BR + EN); and so on. The mRNA and protein level expression data from the CPTAC cancer types was retrieved using the cptac package v.0.9.7 25 in Python 3.9. Each column of the quantitative measurement of the transcriptomics data acted as an independent variable or feature variable whereas the normalized quantitative measurement of a particular protein of interest acted as the single dependent or target variable in the protein model.…”

Section: Methodsmentioning

confidence: 99%

Protein prediction models support widespread post-transcriptional regulation of protein abundance by interacting partners

Srivastava¹,

Lau²

2022

Preprint

View full text Add to dashboard Cite

Protein and mRNA levels correlate only moderately across samples. The availability of proteogenomics data sets with protein and transcript measurements from matching samples is providing new opportunities to assess the degree to which protein levels in a system can be predicted from mRNA information. Here we revisited the protein level prediction problem and analyzed large proteogenomics data from 8 cancer types within the Clinical Proteomic Tumor Analysis Consortium (CPTAC) data set. We trained models to predict the abundance of 13,637 proteins using matching transcriptome data from up to 958 tumor or normal adjacent tissue samples each, and compared predictive performances across algorithms and input features. As data depth increased, the incorporation of additional transcripts as input features increasingly improved the ability to predict sample protein abundance. We found widespread occurrences where the abundance of a protein is considerably less well explained by its own cognate transcript level than that of one or more trans locus transcripts, suggesting post-transcriptional regulations broadly influence protein and mRNA correlation. Transcripts that contribute to non-cognate protein abundance primarily involve those encoding known interaction partners and protein complex members of the protein of interest. Network analysis reveals a proteome-wide interdependency of protein abundance on the transcript levels of interacting proteins. Proteogenomic co-expression analysis may have utility for finding gene interactions and predicting expression changes in biological systems.

show abstract

“…For studies (Clark et al, 2019; Dou et al, 2020a; Gillette et al, 2020; Huang et al, 2021; Krug et al, 2020; Wang et al, 2021) both the transcriptomic and proteomic profiles were obtained from the CPTAC API (Lindgren et al, 2021). For colorectal (Zhang et al, 2014) and breast cancer (Mertins et al, 2016) studies, the transcriptomic data were downloaded from cBioPortal while proteomic data was obtained from the supplemental materials.…”

Section: Methodsmentioning

confidence: 99%

Experimental reproducibility limits the correlation between mRNA and protein abundances in tumour proteomic profiles

Upadhya

Ryan

2021

Preprint

View full text Add to dashboard Cite

Large-scale studies of human proteomes have revealed only a moderate correlation between mRNA and protein abundances. It is unclear to what extent this moderate correlation reflects post-transcriptional regulation and to what extent it reflects measurement error. Here, by analysing replicate proteomic profiles of tumour samples, we show that there is considerable variation in the reproducibility of measurements of individual proteins. We show that proteins with more reproducible measurements tend to have higher mRNA-protein correlation, suggesting that a substantial fraction of the unexplained variation between mRNA and protein abundances may be attributed to limitations in the reproducibility of proteomic quantification. We find that proteins that have high reproducibility in one study tend to have high reproducibility in others and exploit this to develop an 'aggregate protein reproducibility' score. This score can explain a substantial amount of the variation in mRNA-protein correlation across multiple studies of both healthy and tumour samples.

show abstract

Simplified and unified access to cancer proteogenomic data

Cited by 7 publications

References 23 publications

Transcriptome features of striated muscle aging and predictability of protein level changes

Transcriptome features of striated muscle aging and predictability of protein level changes

Protein prediction models support widespread post-transcriptional regulation of protein abundance by interacting partners

Experimental reproducibility limits the correlation between mRNA and protein abundances in tumour proteomic profiles

Contact Info

Product

Resources

About