Understanding the relationship between protein sequence and function is crucial for accurate genetic variant classification. Variant effect predictors (VEPs) play a vital role in deciphering this complex relationship, yet evaluating their performance remains challenging due to data circularity, where the same or related data is used for training and assessment. High-throughput experimental strategies like deep mutational scanning (DMS) offer a promising solution. In this study, we extend upon our previous benchmarking approach, assessing the performance of 84 different VEPs and DMS experiments from 36 different human proteins. In addition, a new pairwise, VEP-centric ranking method reduces the impact of VEP score availability on the overall ranking. We observe a remarkably high correspondence between VEP performance in DMS-based benchmarks and clinical variant classification, especially for predictors that have not been directly trained on human clinical variants. Our results suggest that comparing VEP performance against diverse functional assays represents a reliable strategy for assessing their relative performance in clinical variant classification. However, major challenges in clinical interpretation of VEP scores persist, highlighting the need for further research to fully leverage computational predictors for genetic diagnosis. We also address practical considerations for end users in terms of choice of methodology.