The dataset presented here contains quantitative binding scores of scFv-format antibodies against a SARS-CoV-2 target peptide collected via an AlphaSeq assay that can be used in the development and benchmarking of machine learning models. Starting from three seed sequences identified from a phage display campaign using a human naïve library, four sets of 29,900 antibodies were designed in silico by creating all k = 1 mutations and random k = 2 and k = 3 mutations throughout the complementary-determining regions (CDRs). Of the 119,600 designs, 104,972 were successfully built in to the AlphaSeq library and target binding was subsequently measured with 71,384 designs resulting in a predicted affinity value for at least one of the triplicate measurements. Data include antibodies with predicted affinity measurements ranging from 37 pM to 22 mM. To our knowledge, this dataset is the largest, publicly available dataset that contains antibody sequences, antigen sequence and quantitative measurements of binding scores and provides an opportunity to serve as a benchmark to evaluate antibody-specific representation models for machine learning.
Therapeutic antibodies are an important and rapidly growing drug modality. However, the design and discovery of early-stage antibody therapeutics remain a time and cost-intensive endeavor. Here we present an end-to-end Bayesian, language model-based method for designing large and diverse libraries of high-affinity single-chain variable fragments (scFvs) that are then empirically measured. In a head-to-head comparison with a directed evolution approach, we show that the best scFv generated from our method represents a 28.7-fold improvement in binding over the best scFv from the directed evolution. Additionally, 99% of designed scFvs in our most successful library are improvements over the initial candidate scFv. By comparing a library’s predicted success to actual measurements, we demonstrate our method’s ability to explore tradeoffs between library success and diversity. Results of our work highlight the significant impact machine learning models can have on scFv development. We expect our method to be broadly applicable and provide value to other protein engineering tasks.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.