In order to deal with the huge number of novel protein-coding variants being identified by genome and exome sequencing studies, many computational phenotype predictors have been developed. Unfortunately, such predictors are often trained and evaluated on different protein variant datasets, making a direct comparison between predictors very difficult. Moreover, training and testing datasets may also overlap, introducing training bias. In this study, we use 29 previously published deep mutational scanning (DMS) experiments, which provide quantitative, unbiased phenotypic measurements for large numbers of single amino acid substitutions, in order to benchmark and compare 31 different computational phenotype predictors. We also evaluate the ability of DMS measurements and computational phenotype predictors to discriminate between pathogenic and benign missense variants. We find that DMS experiments based upon competitive growth assays tend to be superior to the top-ranking computational predictors, demonstrating the tremendous potential of DMS for identifying novel human disease mutations. Among the computational phenotype predictors, DeepSequence clearly stood out, showing both the strongest correlations with DMS data and having the best ability to predict pathogenic mutations, which is especially remarkable given that it has not been trained against human mutations. Other predictors we recommend that showed good results when tested against DMS data and human mutations include SNAP2, SNPs&GO, DEOGEN2, VEST4 and REVEL; they also benefit from being much easier for end users than DeepSequence.