This paper presents a systematic examination and experimental comparison of the prominent Federated Learning (FL) frameworks FedML, Flower, Substra, and OpenFL. The frameworks are evaluated experimentally by implementing federated learning over a varying number of clients, emphasizing a thorough analysis of scalability and key performance metrics. The study assesses the impact of increasing client counts on total training time, loss and accuracy values, and CPU and RAM usage. Results indicate distinct performance characteristics among the frameworks, with Flower displaying an unusually high loss, FedML achieving a notably low accuracy range of 66–79%, and Substra demonstrating good resource efficiency, albeit with an exponential growth in total training time. Notably, OpenFL emerges as the most scalable platform, demonstrating consistent accuracy, loss, and training time across different client counts. OpenFL’s stable CPU and RAM underscore its reliability in real-world scenarios. This comprehensive analysis provides valuable insights into the relative performance of FL frameworks, offering good understanding of their capabilities and providing guidance for their effective deployment across diverse user bases.