2023
DOI: 10.48550/arxiv.2301.12068
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Physics-Inspired Protein Encoder Pre-Training via Siamese Sequence-Structure Diffusion Trajectory Prediction

Abstract: Pre-training methods on proteins are recently gaining interest, leveraging either protein sequences or structures, while modeling their joint energy landscape is largely unexplored. In this work, inspired by the success of denoising diffusion models, we propose the DiffPreT approach to pretrain a protein encoder by sequence-structure multimodal diffusion modeling. DiffPreT guides the encoder to recover the native protein sequences and structures from the perturbed ones along the multimodal diffusion trajectory… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
4
0

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 46 publications
0
4
0
Order By: Relevance
“…Although the geometric encoder is able to utilize labeled protein complex structures in ΔΔ G bind datasets, training on a limited set of mutation data could result in overfitting and poor generalization. To address this problem, we further propose a self-supervised pretraining task to exploit large amounts of unlabeled protein structures in PDB [21, 22]. In the pretraining stage, the encoder is trained to model the distribution of these native protein structures via noise contrastive estimation [23].…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Although the geometric encoder is able to utilize labeled protein complex structures in ΔΔ G bind datasets, training on a limited set of mutation data could result in overfitting and poor generalization. To address this problem, we further propose a self-supervised pretraining task to exploit large amounts of unlabeled protein structures in PDB [21, 22]. In the pretraining stage, the encoder is trained to model the distribution of these native protein structures via noise contrastive estimation [23].…”
Section: Resultsmentioning
confidence: 99%
“…Although GearBind can be trained from scratch on labeled ΔΔ G bind datasets, it could suffer from overfitting or poor generalization if the training data size is limited. To address this problem, we propose a self-supervised pretraining task to exploit large-scale unlabeled protein structures in CATH [22, 23]. In the pretraining stage, the encoder is trained to model the distribution of the native protein structures via noise contrastive estimation [24].…”
Section: Resultsmentioning
confidence: 99%
“…Gligorijević et al (2021); Zhang et al (2022); Xu et al (2022) learn residues from a local part of protein structures. Jing et al (2020); Zhang et al (2023) try to capture atomic structure knowledge in proteins. We develop ms -ESM based on ESM.…”
Section: Related Workmentioning
confidence: 99%
“…Graph neural networks (GNNs) have shown remarkable promise in modeling proteins and have been successfully applied to various protein tasks, including fold classification 40,41 , property prediction [40][41][42] , fixed backbone sequence design 27,[43][44][45][46] , and PSCP 24,[27][28][29] . With the rationale that the specific side chain conformations are primarily dependent upon the local environment of the amino acid, we decided to model the PSCP problem with a GNN, wherein each residue is modeled as a node and is connected to its 𝑘 nearest neighbors.…”
Section: Architectural Considerationsmentioning
confidence: 99%