Using high‐quality dataset from 12 flux towers in north China, the performance of four evapotranspiration (ET) models and the multi‐model ensemble approaches including the simple averaging (SA) and Bayesian model average (BMA) were systematically evaluated in this study. The four models were the single‐layer Penman–Monteith (P–M) model, the two‐layer Shuttleworthe–Wallace (S–W) model, the advection–aridity (A–A) model, and a modified Priestley–Taylor (PT‐JPL). Based on the mean value of Taylor skill (S) and the regression slope between measured and simulated ET values across all sites, the order of overall performance of the individual models from the best to the worst were: S–W (0.88, 0.87), PT‐JPL (0.80, 1.17), P–M (0.63, 1.73) and A–A (0.60, 1.68) [statistics stated as (Taylor skill, regression slope)]. Here, all models used the same values of parameters, LAI and fractional vegetation cover as well as the forcing meteorological data. Thus, the differences in model performance were mainly attributed to errors in model structure. To the ensemble approach, the BMA method has the advantage of generating more skillful and reliable predictions than the SA scheme. However, successful implementation of BMA requires accurate estimates of its parameters, and some degradation in performance were observed when the BMA parameters generated from the training period were used for the validation period. Thus, it is necessary to explore the seasonal variations of the BMA parameters according the different growth stages. Finally, the optimal conditional density function of half‐hourly ET approximated well by the double‐exponential distribution. Copyright © 2016 John Wiley & Sons, Ltd.