Benchmarking is a process in which standardized tests are used to assess system performance. The data produced in the process are important for comparative purposes, particularly when considering the implementation and quality assurance of DIR algorithms. In this work, five commercial DIR algorithms (MIM, Velocity, RayStation, Pinnacle, and Eclipse) were benchmarked using a set of 10 virtual phantoms. The phantoms were previously developed based on CT data collected from real head and neck patients. Each phantom includes a start of treatment CT dataset, an end of treatment CT dataset, and the ground‐truth deformation vector field (DVF) which links them together. These virtual phantoms were imported into the commercial systems and registered through a deformable process. The resulting DVFs were compared to the ground‐truth DVF to determine the target registration error (TRE) at every voxel within the image set. Real treatment plans were also recalculated on each end of treatment CT dataset and the dose transferred according to both the ground‐truth and test DVFs. Dosimetric changes were assessed, and TRE was correlated with changes in the DVH of individual structures. In the first part of the study, results show mean TRE on the order of 0.5 mm to 3 mm for all phantoms and ROIs. In certain instances, however, misregistrations were encountered which produced mean and max errors up to 6.8 mm and 22 mm, respectively. In the second part of the study, dosimetric error was found to be strongly correlated with TRE in the brainstem, but weakly correlated with TRE in the spinal cord. Several interesting cases were assessed which highlight the interplay between the direction and magnitude of TRE and the dose distribution, including the slope of dosimetric gradients and the distance to critical structures. This information can be used to help clinicians better implement and test their algorithms, and also understand the strengths and weaknesses of a dose adaptive approach.PACS number(s): 87.57.nj, 87.55.dk, 87.55.Qr