The design of methods for performance evaluation is a major open research issue in the area of spoken language dialogue systems. In this paper we present the PARADISE methodology for developing predictive models of spoken dialogue performance, and then show how to evaluate the predictive power and generalizability of such models. To illustrate our methodology, we develop a number of models for predicting system usability (as measured by user satisfaction), based on the application of PARADISE to experimental data from three di erent spoken dialogue systems. We then measure the extent to which our models generalize across di erent systems, di erent experimental conditions, and di erent user populations, by testing models trained on a subset of our corpus against a test set of dialogues. Our results show that our models generalize well across our three systems, and are thus a rst approximation towards a general performance model of system usability.