Sequence-to-sequence translation from mass spectra to peptides with a transformer model

Fondrie, William E.; Yilmaz, Melih; Bittremieux, Wout; Nelson, Rowan; Ananth, Varun; Oh, Sewoong; Noble, William Stafford

doi:10.1101/2023.01.03.522621

Cited by 26 publications

(52 citation statements)

References 58 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Both DeepNovo and PointNovo pass their peak encodings to an LSTM or an output layer to predict the next amino acid. Casanovo frames the problem as a sequenceto-sequence problem [32] and employs a transformer encoder-decoder framework to process and predict sequences of amino acids. Point-Novo and Casanovo were both retrained using their respective official GitHub repositories (github.com/volpato30/DeepNovoV2 for PointNovo, and github.com/Noble-Lab/casanovo for Casanovo).…”

Section: Pointnovo and Casanovo Implementationsmentioning

confidence: 99%

De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments

Eloff,

Kalogeropoulos,

Morell

et al. 2023

Preprint

View full text Add to dashboard Cite

Bottom-up mass spectrometry-based proteomics is challenged by the task of identifying the peptide that generates a tandem mass spectrum. Traditional methods that rely on known peptide sequence databases are limited and may not be applicable in certain contexts. De novo peptide sequencing, which assigns peptide sequences to the spectra without prior information, is valuable for various biological applications; yet, due to a lack of accuracy, it remains challenging to apply this approach in many situations. Here, we introduce InstaNovo, a transformer neural network with the ability to translate fragment ion peaks into the sequence of amino acids that make up the studied peptide(s). The model was trained on 28 million labelled spectra matched to ~742k human peptides from the ProteomeTools project. We demonstrate that InstaNovo outperforms current state-of-the-art methods on benchmark datasets and showcase its utility in several applications. Building upon human intuition, we also introduce InstaNovo+, a multinomial diffusion model that further improves performance by iterative refinement of predicted sequences. Using these models, we could de novo sequence antibody-based therapeutics with unprecedented coverage, discover novel peptides, and detect unreported organisms in different datasets, thereby expanding the scope and detection rate of proteomics searches. Finally, we could experimentally validate tryptic and non-tryptic peptides with targeted proteomics, demonstrating the fidelity of our predictions. Our models unlock a plethora of opportunities across different scientific domains, such as direct protein sequencing, immunopeptidomics, and exploration of the dark proteome.

show abstract

Section: Pointnovo and Casanovo Implementationsmentioning

confidence: 99%

De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments

Eloff,

Kalogeropoulos,

Morell

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…Finally, the SoftMax function is used to convert the array Z 20 to an output array P 20 with values between 0 and 1 (eq 4), which is the probability distribution of the AA category. 1 20 PSM Scoring Function in CNovo.…”

Section: ■ Experimental Sectionmentioning

confidence: 99%

“…Casanovo was recently reported to use a transformer architecture for de novo peptide sequencing with higher accuracy than DeepNovo and Novor. 20 We also compared SpliceNovo with Casanovo. In DS1-JP, DS2-JP, and DS3-JP, the peptide recall rate for the top-1 results of SpliceNovo was 1.9, −2.8, and −0.5% higher than that of Casanovo (Figure S12a−c).…”

Section: Analyticalmentioning

confidence: 99%

Novel Proteoform Discovery by Precise Semi-De Novo Sequencing of Novel Junction Peptides

Wong

2023

Anal. Chem.

View full text Add to dashboard Cite

Alternative splicing allows a small number of human genes to encode large amounts of proteoforms that play essential roles in normal and disease physiology. Some low-abundance proteoforms may remain undiscovered due to limited detection and analysis capabilities. Peptides coencoded by novel exons and annotated exons separated by introns are called novel junction peptides, which are the key to identifying novel proteoforms. Traditional de novo sequencing does not take into account the specificity in the composition of the novel junction peptide and is therefore not as accurate. We first developed a novel de novo sequencing algorithm, CNovo, which outperformed the mainstream PEAKS and Novor in all six test sets. We then built on CNovo to develop a semi-de novo sequencing algorithm, SpliceNovo, specifically for identifying novel junction peptides. SpliceNovo identifies junction peptides with much higher accuracy than CNovo, CJunction, PEAKS, and Novor. Of course, it is also possible to replace the built-in CNovo in SpliceNovo with other more accurate de novo sequencing algorithms to further improve its performance. We also successfully identified and validated two novel proteoforms of the human EIF4G1 and ELAVL1 genes by SpliceNovo. Our results significantly improve the ability to discover novel proteoforms through de novo sequencing.

show abstract

“…1 Over these decades, numerous algorithmic advances have steadily improved MS data interpretation for sequence identification. 2 The pace of this progress continues unabated along many avenues, including improved interpretation of complex spectra from multiplexed data independent acquisition (plexDIA), 3 de novo sequencing with new embeddings and transformer neural network architectures, 4 improvements in open searches and the identification of peptide modifications, 5 and improved models of isotopic compositions.The latter advances are exemplified by a new approach termed Conditional fragment Ion Distribution Search (CIDS). 6 CIDS can substantially increase sequence identification rates for peptides labeled by using heavy water ( 2 H) or 15 N since such peptides have structural isomers and the distributions of their fragment ions have been difficult to predict.…”

mentioning

confidence: 99%

“…Over these decades, numerous algorithmic advances have steadily improved MS data interpretation for sequence identification . The pace of this progress continues unabated along many avenues, including improved interpretation of complex spectra from multiplexed data independent acquisition (plexDIA), de novo sequencing with new embeddings and transformer neural network architectures, improvements in open searches and the identification of peptide modifications, and improved models of isotopic compositions.…”

mentioning

confidence: 99%

Great Gains in Mass Spectrometry Data Interpretation

Slavov

2023

J. Proteome Res.

View full text Add to dashboard Cite

Sequence-to-sequence translation from mass spectra to peptides with a transformer model

Cited by 26 publications

References 58 publications

De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments

De novo peptide sequencing with InstaNovo: Accurate, database-free peptide identification for large scale proteomics experiments

Novel Proteoform Discovery by Precise Semi-De Novo Sequencing of Novel Junction Peptides

Great Gains in Mass Spectrometry Data Interpretation

Contact Info

Product

Resources

About