In recent years, machine learning (ML) models have been widely developed for building systems. For example, a number of ML models have been developed to predict the load demand of a building. Current ML models commonly report snap-shot accuracy only. Practitioners have difficulties in understanding how a model behaves in usage, i.e., model accuracy may change during model usage. This raises concerns in the ML-model deployment.In this paper, we propose BuildChecks, a behavior testing methodology to systematically evaluate building load forecasting ML models in usage. The challenge of developing such a methodology is to specify "what to evaluate", i.e., given a certain building load forecasting model, what tests we shall apply to this model. We categorize three model-types of the building load forecasting models and we propose three in-usage concerns. Our methodology specifies the tests, i.e., for each model-type, the in-usage concerns that should be tested. We develop an open-source BuildChecks platform to materialize our behavior testing methodology. The BuildChecks platform integrates the testing algorithms and four default realworld building datasets. We use BuildChecks to test the behaviors of two existing load forecasting models. As an example, while a ML model has high accuracy throughout all buildings, BuildChecks reports that in one building this ML-model has a cold start time of 45 days, yet in another building, the cold start time is three-fold greater, 141 days -this can lead to a delay in model usage.
CCS CONCEPTS• General and reference → Evaluation.