In the recent years, several short-term forecasting models of household electricity demand have been proposed in the literature. This is partly due to emerging smart-grid applications, which require these kinds of forecasts to manage systems such as smart homes, prosumer aggregations, etc., and partly thanks to the availability of data from smart meters, which enable the development of such models. Since most models are academically developed, they often do not address challenges related to their implementation in a real-world environment. In the latter case, several issues arise, related to data quality and availability, which affect the operational performance and robustness of a forecasting system. In this paper, we design a hierarchical forecasting framework based on a total of 5 probabilistic models of varying complexity, after analyzing the respective performance and advantages of the models with an offline dataset. This multi-layered framework is necessary to address the various problematic situations occurring in practice and abide by the requirements for a real-world deployment. The forecasting system is deployed in a real-world case and evaluated here on data from 20 households. Field data, comprising forecasts and measurements, are analyzed for each household. A detailed comparison is drawn between the online and offline performances. Since a notable degradation is observed in the operational environment, we discuss at length the reasons for such an effect. We determine that the exact settings of the training and test periods are marginally responsible, but that the main cause is the intrinsic evolution of the demand time series, which hinders the forecasting performance. This evolution is due to unknown household characteristics that need to be monitored to provide more adaptable models.