Abstract. Since their first operational application in the 1950s, atmospheric numerical models have become essential tools in weather and climate prediction. As such, they are subject to continuous changes, thanks to advances in computer systems, numerical methods, more and better observations, and the ever increasing knowledge about the atmosphere of Earth. Many of the changes in today’s models relate to seemingly unsuspicious modifications, associated with minor code rearrangements, changes in hardware infrastructure, or software updates. Such changes are not supposed to significantly affect the model. However, this is difficult to verify, because our atmosphere is a chaotic system, where even a tiny change can have a big impact on individual simulations. Overall this represents a serious challenge to a consistent model development and maintenance framework. Here we propose a new methodology for quantifying and verifying the impacts of minor atmospheric model changes, or its underlying hardware/software system, by using a set of simulations with slightly different initial conditions in combination with a statistical hypothesis test. The methodology can assess effects of model changes on almost any output variable over time, and can also be used with different underlying statistical hypothesis tests. We present first applications of the methodology with a regional weather and climate model, including the verification of a major system update of the underlying supercomputer. While providing very robust results, the methodology shows a great sensitivity even to tiny changes. Results show that changes are often only detectable during the first hours, which suggests that short-term simulations (days to months) are best suited for the methodology, even when addressing long-term climate simulations. We also show that the choice of the underlying statistical hypothesis test is not of importance and that the methodology already works well for coarse resolutions, making it computationally inexpensive and therefore an ideal candidate for automated testing.