Irwin tests are key preclinical study elements for characterizing drug-induced neurological side effects. This multicenter study aimed to assess the robustness of Irwin tests across multinational sites during three stages of protocol harmonization. The projects were part of the EQIPD framework (Enhanced Quality in Preclinical Data, https://quality-preclinical-data.eu/), aiming to increase success rates in transition from preclinical testing to clinical application. Female and male NMRI mice were assigned to one of three groups (vehicle, 0.1 mg/kg MK-801, 0.3 mg/kg MK-801). Irwin scores were assessed at baseline and multiple times following injection of MK-801, a non-competitive NMDA antagonist, using local protocols (stage 1), a shared protocol with harmonized environmental design (stage 2), and fully harmonized Irwin scoring protocols (stage 3). The analysis based on the four functional domains (motor, autonomic, sedation, and excitation) revealed substantial data variability in stages 1 and 2. Although there was still marked overall heterogeneity between sites in stage 3 after complete harmonization of the Irwin scoring scheme, heterogeneity was only moderate within functional domains. When comparing treatment groups vs. vehicle, we found large effect sizes in the motor domain and subtle to moderate effects in the excitation-related and autonomic domain. The pronounced interlaboratory variability in Irwin datasets for the CNS-active compound MK-801 needs to be carefully considered by companies and experimenters when making decisions during drug development. While environmental and general study design had a minor impact, the study suggests that harmonization of parameters and their scoring can limit variability and increase robustness.