The present study’s aim was to assess the test−retest reliability (TRR) of the ‘Welfare Quality® animal welfare assessment protocol for sows and piglets’ focusing on the welfare principle ‘appropriate behavior’. TRR was calculated using Spearman’s rank correlation coefficient (RS), intraclass correlation coefficient (ICC), smallest detectable change (SDC), and limits of agreement (LoA). Principal component analysis (PCA) was used for deeper analysis of the Qualitative Behavior Assessment (QBA). The study was conducted on thirteen farms in Northern Germany, which were visited five times by the same observer. Farm visits 1 (F1; day 0) were compared to farm visits 2 to 5 (F2–F5). The QBA indicated no TRR when applying the statistical parameters introduced above (e.g., ‘playful‘ (F1–F4) RS 0.08 ICC 0.00 SDC 0.50 LoA [−0.62, 0.38]). The PCA detected contradictory TRR. Acceptable TRR could be found for parts of the instantaneous scan sampling (e.g., negative social behavior (F1–F3) RS 0.45 ICC 0.37 SDC 0.02 LoA [−0.03, 0.02]). The human−animal relationship test solely achieved poor TRR, whereas scans for stereotypies showed sufficient TRR (e.g., floor licking (F1–F4) RS 0.63 ICC 0.52 SDC 0.05 LoA [−0.08, 0.04]). Concluding, the principle ‘appropriate behavior’ does not represent TRR and further investigation is needed before implementation on-farm.