2022
DOI: 10.48550/arxiv.2205.15675
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Contrastive Representation Learning for 3D Protein Structures

Abstract: Learning from 3D protein structures has gained wide interest in protein modeling and structural bioinformatics. Unfortunately, the number of available structures is orders of magnitude lower than the training data sizes commonly used in computer vision and machine learning. Moreover, this number is reduced even further, when only annotated protein structures can be considered, making the training of existing models difficult and prone to over-fitting. To address this challenge, we introduce a new representatio… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
16
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
4
2
1
1

Relationship

0
8

Authors

Journals

citations
Cited by 12 publications
(16 citation statements)
references
References 44 publications
0
16
0
Order By: Relevance
“…As discussed before, learning on 3D structures cannot benefit from these large amounts of sequential data. Due to this fact, model sizes of those GGNNs are therefore limited or overfitting may occur [39]. On the contrary, it can be seen, comparing the number of protein sequences in the UniProt database [19] to the number of known structures in the PDB, over 1700 times more sequences than structures.…”
Section: Methodsmentioning
confidence: 99%
“…As discussed before, learning on 3D structures cannot benefit from these large amounts of sequential data. Due to this fact, model sizes of those GGNNs are therefore limited or overfitting may occur [39]. On the contrary, it can be seen, comparing the number of protein sequences in the UniProt database [19] to the number of known structures in the PDB, over 1700 times more sequences than structures.…”
Section: Methodsmentioning
confidence: 99%
“…As discussed before, learning on 3D structures cannot beneőt from these large amounts of sequential data. Due to this fact, model sizes of those GGNNs are therefore limited or overőtting may occur [39]. On the contrary, it can be seen, comparing the number of protein sequences in the UniProt database [19] to the number of known structures in the PDB, over 1700 times more sequences than structures.…”
Section: Methodsmentioning
confidence: 99%
“…The pre-trained PLMs have achieved impressive performance on a variety of downstream tasks for structure and function prediction (Rao et al, 2019;Xu et al, 2022c). Recent works have also studied pre-training on unlabeled protein structures for generalizable representations, covering contrastive learning (Zhang et al, 2022;Hermosilla & Ropinski, 2022), self-prediction of geometric quantities (Zhang et al, 2022;Chen et al, 2022) and denoising score matching (Guo et al, 2022;Wu et al, 2022a).…”
Section: Related Workmentioning
confidence: 99%
“…Among these methods, various pre-training approaches (Elnaggar et al, 2021;Rives et al, 2021;Zhang et al, 2022) succeed in learning effective protein representations from amount of available protein sequences or from their experimental/predicted structures. Sequence-based pre-training methods (Elnaggar et al, 2021;Rives et al, 2021) can well acquire co-evolutionary information, and structure-based pre-training methods (Zhang et al, 2022;Hermosilla & Ropinski, 2022) are able to capture protein structural characteristics sufficient for tasks like function prediction and fold classification. These two types of information are both useful to indicate underlying protein functions, and they are complementary to each other.…”
Section: Introductionmentioning
confidence: 99%