Recent interest has surged in building large-scale foundation models for medical applications. In this paper, we propose a general framework for evaluating the efficacy of these foundation models in medicine, suggesting that they should be assessed across three dimensions: general performance, bias/fairness, and the influence of confounders. Utilizing Google's recently released dermatology embedding model and lesion diagnostics as examples, we demonstrate that: 1) dermatology foundation models surpass state-of-the-art classification accuracy; 2) general-purpose CLIP models encode features informative for medical applications and should be more broadly considered as a baseline; 3) skin tone is a key differentiator for performance, and the potential bias associated with it needs to be quantified, monitored, and communicated; and 4) image quality significantly impacts model performance, necessitating that evaluation results across different datasets control for this variable. Our findings provide a nuanced view of the utility and limitations of large-scale foundation models for medical AI.