Therapeutic antibodies are an important and rapidly growing drug modality. However, the design and discovery of early-stage antibody therapeutics remain a time and cost-intensive endeavor. Here we present an end-to-end Bayesian, language model-based method for designing large and diverse libraries of high-affinity single-chain variable fragments (scFvs) that are then empirically measured. In a head-to-head comparison with a directed evolution approach, we show that the best scFv generated from our method represents a 28.7-fold improvement in binding over the best scFv from the directed evolution. Additionally, 99% of designed scFvs in our most successful library are improvements over the initial candidate scFv. By comparing a library’s predicted success to actual measurements, we demonstrate our method’s ability to explore tradeoffs between library success and diversity. Results of our work highlight the significant impact machine learning models can have on scFv development. We expect our method to be broadly applicable and provide value to other protein engineering tasks.
Therapeutic antibody development has become an increasingly popular approach for drug development. To date, antibody therapeutics are largely developed using large scale experimental screens of antibody libraries containing hundreds of millions of antibody sequences. The high cost and difficulty of developing therapeutic antibodies create a pressing need for computational methods to predict antibody properties and create bespoke designs. However, the relationship between antibody sequence and activity is a complex physical process and traditional iterative design approaches rely on large scale assays and random mutagenesis. Deep learning methods have emerged as a promising way to learn antibody property predictors, but predicting antibody properties and target-specific activities depends critically on the choice of antibody representations and data linking sequences to properties is often limited. Recently, methods for learning biological sequence representations from large, general purpose protein datasets have demonstrated impressive ability to capture structural and functional properties of proteins. However, existing works have not yet investigated the value, limitations and opportunities of these methods in application to antibody-based drug discovery. In this paper, we present results on a novel SARS-CoV-2 antibody binding dataset and an additional benchmark dataset. We compare three classes of models: conventional statistical sequence models, supervised learning on each dataset independently, and fine-tuning an antibody specific pre-trained embedding model. The pre-trained models were trained on tens of millions of natural healthy human antibody sequences and protein sequences, respectively. Experimental results suggest that self-supervised pretraining of feature representation consistently offers significant im-DISTRIBUTION STATEMENT A. Approved for public release. Distribution is unlimited.
Therapeutic antibodies are an important and rapidly growing drug modality. However, the design and discovery of early-stage antibody therapeutics remain a time and cost-intensive endeavor. In this work, we present an end-to-end Bayesian, language model-based method for designing large and diverse libraries of high-affinity single-chain variable fragments (scFvs). We integrate target-specific binding affinities with information from millions of natural protein sequences in a probabilistic machine learning framework to design thousands of scFvs that are then empirically measured. In a head-to-head comparison with a directed evolution approach, we show that the best scFv generated from our method represents a 28.8-fold improvement in binding over the best scFv from the directed evolution. Additionally, 99% of the designed scFvs in our most successful library are improvements over the initial candidate scFv. By comparing a library's predicted success to actual measurements, we demonstrate our method's ability to explore tradeoffs between library success and diversity during the design phase and prior to experimental testing. The results of our work highlight the significant impact machine learning models can have on scFv development. We expect our end-to-end method to be broadly applicable and able to provide value to other protein engineering tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.