Summary
A time series is a sequence of data points in successive temporal order. Time series data is produced in many applications scenarios, and the techniques for its analysis have generated substantial interest. Time series join is a primitive operation that retrieves all pairs of correlated subsequences from two given time series. As the Pearson correlation coefficient, a measure of the correlation between two variables, has multiple beneficial mathematical properties, for example, the fact that it is invariant with respect to scale and offset, it is used to measure the correlation between two time series. Considering the need to analyze big time series data, we focus on the study of scalable and distributed techniques to process massive data sets. Specifically, we propose a parallel approach to perform time series joins using Spark, a popular analytics engine for large‐scale data processing. Our solution builds on (1) a fast method to compute the fast Fourier transform on the times series to calculate the correlation between two time series, (2) a lossless partition method to divide the time series into multiple subsequences and enable a parallel and correct computation of the join result, and (3) optimization techniques to avoid redundant computations. We performed extensive tests and showed that the proposed approach is efficient and scalable across different data sets and test configurations.