“…Large pretrained protein language models, largely relying on the attention-based transformer [Vaswani et al, 2017] architecture, have advanced the ability of machine-learning methods to predict protein structure and function from sequence, especially when labeled training data is sparse. Most modern self-supervised protein sequence pretraining combines a transformer model with either an autoregressive likelihood [Madani et al, 2020, 2021, Ferruz et al, 2022, Hesslow et al, 2022] or with the masked language modeling (MLM) task introduced for natural language by BERT (bidirectional encoder representations from transformers) [Devlin et al, 2018]. Pretrained transformer protein MLMs contain structural information [Rao et al, 2019, Rives et al, 2021, Chowdhury et al, 2021], encode evolutionary trajectories [Hie et al, 2022a, 2021], are zero-shot predictors of mutation fitness effects [Meier et al, 2021], improve out-of-domain generalization on protein engineering datasets [Dallago et al, 2021], and suggest improved sequences for engineering [Hie et al, 2022b].…”