We developed new criteria for determining the library size in a saturation mutagenesis experiment. When the number of all possible distinct variants is large, any of the top-performing variants (e.g., any of the top three) is likely to meet the design requirements, so the probability that the library contains at least one of them is a sensible criterion for determining the library size. By using a criterion of this type, one may significantly reduce the library size and thus save costs and labor while minimally compromising the quality of the best variant discovered. We present the probabilistic tools underlying these criteria and use them to compare the efficiencies of four randomization schemes: NNN, which uses all 64 codons; NNB, which uses 48 codons; NNK, which uses 32 codons; and MAX, which assigns equal probabilities to each of the 20 amino acids. MAX was found to be the most efficient randomization scheme and NNN the least efficient. TopLib, a computer program for carrying out the related calculations, is available through a user-friendly Web server.
Saturation mutagenesis (also called oligonucleotide-directed randomization) is a protein-engineering technique that has been used widely and successfully to improve protein properties such as catalytic activity, enantioselectivity, thermostability, and binding affinity (3,12,14,16). We use the term "activity" for the protein's property under optimization, but the methodology developed below is aimed at any desirable protein feature that may be influenced by mutation.In saturation mutagenesis, one or more positions along the protein sequence are identified as likely to accommodate beneficial mutations and are then randomized, i.e., the amino acids at these positions are replaced by random ones. The randomization originates at the DNA level, typically via degenerate primers containing a mixture of sequences at the chosen codons. To decrease the chances of introducing a premature stop codon, reduced codon sets are often used: NNB, NNS, and NNK codons (where N ϭ A/C/G/T, B ϭ C/G/T, S ϭ C/G, and K ϭ G/T) are popular choices that still encode all 20 amino acids, but the use of codon sets encoding fewer amino acids has been also advocated (9, 15). More sophisticated randomization schemes, such as MAX (6, 7), result in equal probabilities for all 20 amino acids (or for some predetermined subset thereof) without encoding stop codons. Either way, a large number of random variants, which together constitute a library, are produced and then screened in an attempt to discover a highly active variant among them. Clearly, the larger the library, the higher the probability of exploring more distinct variants. We shall use the term "variant space" to denote the set of all possible distinct variants in a given experiment; this space is determined by the number of positions randomized and the randomization scheme.The probabilistic literature on saturation mutagenesis (1, 5, 8, 11) focuses mainly on two mathematical quantities: the first is the expected percentage of variant spa...