Background: Despite the immense importance of transmembrane proteins (TMP) for molecular biology and medicine, experimental 3D structures for TMPs remain about 4-5 times underrepresented compared to non-TMPs. Today's top methods can accurately predict structures for many, but the annotations of the transmembrane regions remains a limiting step for proteome-wide predictions.
Results: Here, we present a novel method, dubbed TMbed. Inputting embeddings from protein Language Models (in particular ProtT5), TMbed completes predictions of alpha helical and beta barrel TMPs for entire proteomes within hours on a single consumer-grade desktop machine at performance levels similar or better than methods, which are using evolutionary information (extracted from family alignments). On the per-protein level, TMbed correctly identified 61 of the 65 beta-barrel TMPs (94±7%) and 579 of the 593 alpha-helical TMPs (98±1%) in a non-redundant data set, at false positive rates well below 1% (erred on 31 of 5859 non-membrane proteins). On the per-segment level, TMbed correctly placed, on average, 9 of 10 transmembrane segments within five residues of the experimental observation. Although limited by GPU memory, our method can handle sequence of up to 4200 residues on standard graphics cards used in common desktop PCs (e.g., NVIDIA GeForce RTX 3060).
Conclusions: TMbed accurately predicts alpha helical and beta barrel TMPs. Utilizing protein Language Models and GPU acceleration it can predict the human in less than an hour.
Availability: Our code, method, and data sets are freely available in the GitHub repository, https://github.com/BernhoferM/TMbed