Empirical models, previously called land-use regression (LUR), are used to understand and predict spatial variability in levels of outdoor air pollution at unmeasured locations, for example, to conduct health risk assessment, environmental epidemiology, or environmental justice analysis. Many methods are used to generate empirical models, yet almost no research compares models generated by separate research groups. We intercompare six national-scale empirical models for year-2010 concentrations of PM2.5 in the US, each generated by a different research group. Despite substantial differences in the statistical methods and input data used to build the models, our main finding is a relatively high degree of agreement among model predictions. For example, in pairwise intercomparisons, the average Pearson correlation coefficient is 0.87 (range: 0.84 to 0.92); the RMSD (root-mean-square-difference; units: μg/m3) is 1.1 on average (range: 0.8 to 1.4), or ~12% of the average concentration; and many best-fit lines are near the 1:1 line. The underlying reason for this agreement is likely that, while the methods and the independent variables differ among the models, in all cases the models are built using, and are calibrated to, the same information: publicly available measurement at US EPA regulatory monitoring stations. Findings here suggest that future improvements to national empirical models will come not from further refinements to the methods (e.g., more-advanced models) but from employing a fundamentally different set of observations, in addition to regulatory monitoring data.