Machine learning (ML) is changing the world of computational protein design, with data- driven methods surpassing biophysical-based methods in experimental success rates. However, they are most often reported as case studies, lack integration and standardization across platforms, and are therefore hard to objectively compare. In this study, we established a streamlined and diverse toolbox for methods that predict amino acid probabilities inside the Rosetta software framework that allows for the side-by-side comparison of these models. Subsequently, existing protein fitness landscapes were used to benchmark novel self- supervised machine learning methods in realistic protein design settings. We focused on the traditional problems of protein sequence design: sampling and scoring. A major finding of our study is that novel ML approaches are better at purging the sampling space from deleterious mutations. Nevertheless, scoring resulting mutations without model fine-tuning showed no clear improvement over scoring with Rosetta. This study fills an important gap in the field and allows for the first time a comprehensive head-to-head comparison of different ML and biophysical methods. We conclude that ML currently acts as a complement to, rather than a replacement for, biophysical methods in protein design.