Summary
With the rapid development of technologies such as the internet, the amount of data that are collected or generated in many areas such as in the agricultural, biomedical, and finance sectors poses challenges to the scientific community because of the volume and complexity of the data. Furthermore, the need of analysis tools that extract useful information for decision support has been receiving more attention in order for researchers to find a scalable solution to traditional algorithms. In this paper, we proposed a scalable design and implementation of a particle swarm optimization classification (SCPSO) approach that is based on the Apache Spark framework. The main idea of the SCPSO algorithm is to find the optimal centroid for each target label using particle swarm optimization and then assign unlabeled data points to the closest centroid. Two variants of SCPSO, SCPSO‐F1 and SCPSO‐F2, were proposed based on different fitness functions, which were tested on real data sets in order to evaluate their scalability and performance. The experimental results revealed that SCPSO‐F1 and SCPSO‐F2 scale very well with increasing data set sizes and the speedup of SCPSO‐F2 is almost identical to the linear speedup while the speedup of SCPSO‐F1 is very close to the linear speedup. Thus, SCPSO‐F1 and SCPSO‐F2 can be efficiently parallelized using the Apache Spark framework.