The aim of this work is to provide insights into multiple metrics clinical validation of deformable image registration and contour propagation methods in 4D lung radiotherapy planning. The following indices were analyzed and compared: Volume Difference (VD), Dice Similarity Coefficient (DSC), Positive Predictive Value (PPV) and Surface Distances (SD). The analysis was performed on three patient datasets, using as reference a ground-truth volume generated by means of Simultaneous Truth And Performance Level Estimation (STAPLE) algorithm from the outlines of five experts. Significant discrepancies in the quality assessment provided by the different metrics in all the examined cases were found. Metrics sensitivity was more evident in presence of image artifacts and particularly for tubular anatomical structures, such as esophagus or spinal cord. Volume Differences did not account for position and DSC exhibited criticalities due to its intrinsic symmetry (i.e. over-and under-estimation of the reference contours cannot be discriminated) and dependency on the total volume of the structure. PPV analysis showed more robust performance, as each voxel concurs to the classification of the propagation, but was not able to detect inclusion of propagated and ground-truth volumes.Mesh distances could interpret the actual shape of the structures, but might report higher mismatches in case of large local differences in the contour surfaces. According to our study, the combination of VD and SD for the validation of contour propagation algorithms in 4D could provide the necessary failure detection accuracy.