We assess the performance of different break detection methods on three sets of benchmark data sets, each consisting of 120 daily time series of integrated water vapor differences. These differences are generated from the Global Positioning System (GPS) measurements at 120 sites worldwide, and the numerical weather prediction reanalysis (ERA-Interim) integrated water vapor output, which serves as the reference series here. The benchmark includes homogeneous and inhomogeneous sections with added nonclimatic shifts (breaks) in the latter. Three different variants of the benchmark time series are produced, with increasing complexity, by adding autoregressive noise of the first order to the white noise model and the periodic behavior and consecutively by adding gaps and allowing nonclimatic trends. The purpose of this "complex experiment" is to examine the performance of break detection methods in a more realistic case when the reference series are not homogeneous. We evaluate the performance of break detection methods with skill scores, centered root mean square errors (CRMSE), and trend differences relative to the trends of the homogeneous series. We found that most methods underestimate the number of breaks and have a significant number of false detections. Despite this, the degree of CRMSE reduction is significant (roughly between 40% and 80%) in the easy to moderate experiments, with the ratio of trend bias reduction is even exceeding the 90% of the raw data error. For the complex experiment, the improvement ranges between 15% and 35% with respect to the raw data, both in terms of RMSE and trend estimations.