2024
DOI: 10.1101/2024.01.29.577794
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Aggregating Residue-Level Protein Language Model Embeddings with Optimal Transport

Navid NaderiAlizadeh,
Rohit Singh

Abstract: MotivationProtein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into informative embeddings suitable for a range of applications. PLMs, as well as many other protein representation schemes, generate per-token (i.e., per-residue) representations, leading to variable-sized outputs based on protein length. This variability presents a challenge for protein-level prediction tasks, which require uniform-sized embeddings for consistent analysis across different proteins. Pri… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1

Citation Types

0
1
0

Year Published

2024
2024
2024
2024

Publication Types

Select...
4
1

Relationship

0
5

Authors

Journals

citations
Cited by 5 publications
(1 citation statement)
references
References 50 publications
0
1
0
Order By: Relevance
“…Here, we chose to take the former approach as it explicitly integrates signal across the length of the protein. We note that while this is a commonly used approach, how to best aggregate sequence-length representations into a fixed dimension embedding is an open problem in language modeling [see Naderi Alizadeh and Singh ( 54 ) for one recently proposed approach]. Converting this pooled embedding into a binary ( ) or multiclass ( , 17 possible symmetry classes + “Unknown”) prediction requires an additional classification head.…”
Section: Methodsmentioning
confidence: 99%
“…Here, we chose to take the former approach as it explicitly integrates signal across the length of the protein. We note that while this is a commonly used approach, how to best aggregate sequence-length representations into a fixed dimension embedding is an open problem in language modeling [see Naderi Alizadeh and Singh ( 54 ) for one recently proposed approach]. Converting this pooled embedding into a binary ( ) or multiclass ( , 17 possible symmetry classes + “Unknown”) prediction requires an additional classification head.…”
Section: Methodsmentioning
confidence: 99%