The automatic detection of near duplicate video segments, such as multiple takes of a scene or different news video clips showing the same event, has received growing research interest in recent years. However, there is no agreed way of evaluating near duplicate detection algorithms. This makes it very hard to compare the performance of different algorithms, even if they are applied to the same data set. In this paper we have implemented several evaluation measures found in literature and we apply them to real algorithm outputs and a simulated result data set. We then calculate the correlation between the results obtained with the different measures in order to investigate whether they can be compared or not. The results show that the correlation between the measures is some cases quite low, and some measures are especially sensitive to certain types of deviations from the ground truth. However, a group of precision/recall type measures and two others are clearly correlated, though with moderate correlation coefficients. We also analyze the correlation between these measures and the subjective human judgment of the number of repeated segments in summary videos.