The non-random distribution of alleles of common genomic variants produces haplotypes, which are fundamental in medical and population genomic studies. Consequently, protein-coding genes with different co-occurring sets of alleles can encode different amino acid sequences: protein haplotypes. These protein haplotypes are present in biological samples, and detectable by mass spectrometry, but are not accounted for in proteomic searches. Consequently, the impact of haplotypic variation on the results of proteomic searches, and the discoverability of peptides specific to haplotypes remain unknown. Here, we study how common genetic haplotypes influence the proteomic search space and investigate the possibility to match peptides containing multiple amino acid substitutions to a publicly available data set of mass spectra. We found that for 9.96 % of the discoverable amino acid substitutions encoded by common haplotypes, two or more substitutions may co-occur in the same peptide after tryptic digestion of the protein haplotypes. We identified 342 spectra that matched to such multi-variant peptides, and out of the 4,251 amino acid substitutions identified, 6.63 % were covered by multi-variant peptides. However, the evaluation of the reliability of these matches remains challenging, suggesting that refined error rate estimation procedures are needed for such complex proteomic searches. As these become available and the ability to analyze protein haplotypes increases, we anticipate that proteomics will provide new information on the consequences of common variation, across tissues and time.
Monogenic diabetes is characterized as a group of diseases caused by rare variants in single genes. Multiple genes have been described to be responsible for monogenic diabetes, but the information on the variants is not unified among different resources. We have developed an automated pipeline for collecting and harmonizing data on genetic variants associated with monogenic diabetes. Furthermore, we have translated variant genetic sequences into protein sequences accounting for all protein isoforms and their variants. This allows researchers to consolidate information on variant genes and proteins associated with monogenic diabetes and facilitates their study using proteomics or structural biology. Our open and flexible implementation using Jupyter notebooks enables tailoring and modifying the pipeline and its application to other rare diseases.
Precision medicine focuses on adapting care to the individual profile of patients, e.g. accounting for their unique genetic makeup. Being able to account for the effect of genetic variation on the proteome holds great promises towards this goal. However, identifying the protein products of genetic variation using mass spectrometry has proven very challenging. Here we show that the identification of variant peptides can be improved by the integration of retention time and fragmentation predictors into a unified proteogenomic pipeline. By combining these intrinsic peptide characteristics using the search-engine post-processor Percolator, we demonstrate improved discrimination power between correct and incorrect peptide-spectrum matches. Our results demonstrate that the drop in performance that is induced when expanding a protein sequence database can be compensated, and hence enabling efficient identification of genetic variation products in proteomics data. We anticipate that this enhancement of proteogenomic pipelines can provide a more refined picture of the unique proteome of patients, and thereby contribute to improving patient care.
Monogenic diabetes is characterized as a group of diseases caused by rare variants in single genes. Multiple genes have been described to be responsible for monogenic diabetes, but the information on the variants is not unified among different resources. We have developed an automated pipeline for collecting and harmonizing data on genetic variants associated with monogenic diabetes. Furthermore, we have translated variant genetic sequences into protein sequences accounting for all protein isoforms and their variants. This allows researchers to consolidate information on variant genes and proteins associated with monogenic diabetes and facilitates their study using proteomics or structural biology. Our open and flexible implementation using Jupyter notebooks enables tailoring and modifying the pipeline and its application to other rare diseases.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.