Daily River Malaba flows recorded from 1999 to 2016 were modelled using seven lumped conceptual rainfall–runoff models including AWBM, SACRAMENTO, TANK, IHACRES, SIMHYD, SMAR and HMSV. Optimal parameters of each model were obtained using an automatic calibration strategy. Mismatches between observed and modelled flows were assessed using a total of nine “goodness-of-fit” metrics. Capacity of the models to reproduce historical hydrological extremes was assessed through comparison of amplitude–duration–frequency (ADF) relationships or curves constructed based on observed and modelled flow quantiles. Generally, most of the hydrological models performed better for high than low flows. ADF curves of both high and low flows for various return periods from 5 to 100 years were well reproduced by AWBM, SAC, TANK and HMSV. ADF curves for high and low flows were poorly reproduced by SIMHYD and SMAR, respectively. Overall, AWBM performed slightly better than other models if both high and low flows are to be considered simultaneously. The deviations of these models were larger for high than low return periods. It was found that the choice of a “goodness-of-fit” metric affects how model performance can be judged. Results from this study also show that when focusing on hydrological extremes, uncertainty due to the choice of a particular model should be taken into consideration. Insights from this study provide relevant information for planning of risk-based water resources applications.