Proteogenomics suffers from statistical issues as the sequencing information inflates the database size. To compensate for this, rescoring with the machine learning-based spectrum predictors MS 2 PIP and Prosit was implemented in a proteogenomics approach. This was demonstrated for both ribosome profiling and nanopore RNA-Seq derived databases. Postprocessing with Percolator showed that these techniques result in recovered and often even elevated stringency levels and identification rates. In this way, it allows to validate novel proteoforms through proteogenomics with unsurpassed confidence levels.
Highlights• First proteogenomics with PSM rescoring using machine learning-predicted spectra • Demonstrated on both ribosome profiling and nanopore RNA-Seq-derived databases • Rescoring leads to elevated stringency and increased identification rates • Rescoring compensates for the search space size issues in proteogenomics