As an unsupervised learning technique, clustering can effectively capture the patterns in a data stream based on similarities among the data. Traditional data stream clustering algorithms either heavily depend on some prior knowledge or predefined parameters while the characteristics of real-time data are considered unknown. Besides, the user-specified threshold is used to overcome the effect of outliers and noises, which significantly affects the clustering performance. The overlap among clusters is another major challenge for the existing stream clustering methods. These constraints strongly limit their real-time applications. In this paper, a two-phase stream clustering algorithm based on fitness proportionate sharing is proposed. It handles streaming data when prior knowledge is not available and maps the clustering problem into a multimodal optimization problem. It introduces a density-based objective function and adopts the fitness proportionate sharing strategy to perform a more effective search for the cluster centers. To capture the dynamic characteristics of streaming data, a recursive formula for the lower bound of the density function is derived, and a summary of historical data is established for the proposed algorithm. The proposed algorithm is applied to different data sets, and a comprehensive comparison between the proposed algorithm and five well-known data stream clustering algorithms in the literature is provided. Results show comparable or better performance relative to five popular data stream clustering algorithms. A scalability analysis of the proposed streaming clustering method is presented in this paper as well.INDEX TERMS Data streams, clustering, unsupervised learning, data mining.
I. INTRODUCTION