Modelling gully erosion in urban areas is challenging due to difficulties with equifinality and parameter identification, which complicates quantification of management impacts on runoff and sediment production. We calibrated a model (AnnAGNPS) of an ephemeral gully network that formed on unpaved roads following a storm event in an urban watershed (0.2 km2) in Tijuana, Mexico. Latin hypercube sampling was used to create 500 parameter ensembles. Modelled sediment load was most sensitive to the Soil Conservation Service (SCS) curve number, tillage depth (Td), and critical shear stress (τc). Twenty-one parameter ensembles gave acceptable error (behavioural models), though changes in parameters governing runoff generation (SCS curve number, Manning’s n) were compensated by changes in parameters describing soil properties (TD, τc, resulting in uncertainty in the optimal parameter values. The most suitable parameter combinations or “behavioural models” were used to evaluate uncertainty under management scenarios. Paving the roads increased runoff by 146–227%, increased peak discharge by 178–575%, and decreased sediment load by 90–94% depending on the ensemble. The method can be used in other watersheds to simulate runoff and gully erosion, to quantify the uncertainty of model-estimated impacts of management activities on runoff and erosion, and to suggest critical field measurements to reduce uncertainties in complex urban environments.