BackgroundPredicting models of the gamma passing rate (GPR) have been studied to substitute the measurement‐based gamma analysis. Since these studies used data from different radiotherapy systems comprising TPS, linear accelerator, and detector array, it has been difficult to compare the performances of the predicting models among institutions with different radiotherapy systems.PurposeWe aimed to develop unbiased scoring methods to evaluate the performance of the models predicting the GPR, by introducing both best and worst limits for the performance of the GPR prediction.MethodsTwo hundred head‐and‐neck VMAT plans were used to develop a framework. The GPRs were measured using the ArcCHECK device. The predicted GPR [p] was generated using a deep learning‐based model [pDL]. The predicting model was evaluated using four metrics: standard deviation (SD) [σ], Pearson's correlation coefficient (CC) [r], mean squared error (MSE) [s], and mean absolute error (MAE) [a]. The best limit [, , , and ] was estimated by measuring the SD of measured GPR [m] by shifting the device along the longitudinal direction to measure different sampling points. Mimicked best and worst p’s [pbest and pworst] were generated from pDL. The worst limit was defined such that m and p have no correlation [CC ∼ 0]. The worst limit [σMix, rMix, sMix, and aMix] was generated using the event‐mixing (EM) technique originally introduced in high‐energy physics experiments. The range of σ, r, s, and a was defined to be , , , and . The achievement score (AS) independently based on σ, r, s, and a were calculated for pDL, pbest and pworst. The probability that p fails the gamma analysis (alert frequency; AF) was estimated as a function of values within the [, σMix] range for the 3%/2 mm data with a 95% criterion.ResultsSDs of the best limit were well reproduced by . The EM technique successfully generated the pairs with no correlation. The AS using four metrics showed good agreement. This agreement indicates successful definitions of both best and worst limits, consistent definitions of the AS, and successful generations of mixed events. The AF for the DL‐based model with the 3%/2 mm tolerance was 31.5% and 63.0% with CL's 99% and 99.9%, respectively.ConclusionWe developed the AS to evaluate the predicting model of the GPR in an unbiased manner by excluding the effects of the precision of the radiotherapy system and the spreading of the GPR. The best and worst limits of the GPR prediction were successfully generated using the measured precision of the GPR and the EM technique, respectively. The AS and are expected to enable objective evaluation of the predicting model and setting exact achievement goal of precision for the predicted GPR.