Accurately predicting weather and climate in cities is critical for safeguarding human health and strengthening urban resilience. Multi‐model evaluations can lead to model improvements, however there have been no major intercomparisons of urban‐focused land surface models in over a decade. Here, in Phase 1 of the Urban‐PLUMBER project, we evaluate 30 land surface models' ability to simulate surface energy fluxes critical to atmospheric meteorological and air quality simulations. We establish minimum and upper performance expectations for participating models using simple information‐limited models as benchmarks. Compared with the last major model intercomparison at the same site, we find broad improvement in the current cohort's predictions of shortwave radiation, sensible and latent heat fluxes, but little or no improvement in longwave radiation and momentum fluxes. Models with a simple urban representation (e.g. “slab” schemes) generally perform well, particularly when combined with sophisticated hydrological/vegetation models. Some mid‐complexity models (e.g. “canyon” schemes) also perform well, indicating efforts to integrate vegetation and hydrology processes have paid dividends. The most complex models that resolve three‐dimensional interactions between buildings in general did not perform as well as other categories. However, these models also tended to have the simplest representations of hydrology and vegetation. Models without any urban representation (i.e. vegetation‐only land surface models) performed poorly for latent heat fluxes, and reasonably for other energy fluxes at this suburban site. Our analysis identified widespread human errors in initial submissions that substantially affected model performances. Although significant efforts are applied to correct these errors, we conclude that human factors are likely to influence results in this (or any) model intercomparison, particularly where participating scientists have varying experience and first languages. These initial results are for one suburban site, and future phases of Urban‐PLUMBER will evaluate models across twenty sites in different urban and regional climate zones.