Federated learning (FL) is a distributed approach to train machine learning models without disclosing private data from participating clients to a central server. Nevertheless, FL training struggles to converge when clients have distinct data distributions, which leads to an increased training time and model prediction error. We propose ATHENA-FL, a federated learning system that considers clients with heterogeneous data distributions to generate accurate models in fewer training epochs than state-of-the-art approaches. ATHENA-FL reduces communication costs, providing an additional positive aspect for resource-constrained scenarios. ATHENA-FL mitigates data heterogeneity by introducing a preliminary step before training that clusters clients with similar data distribution. To handle that, we use the weights of a locally trained neural network used as a probe. The proposed system also uses the one-versus-all model to train one binary detector for each class in the cluster. Thus, clients can compose complex models combining multiple detectors. These detectors are shared with all participants through the system's database. We evaluate the clustering procedure using different layers from the neural network and verify that the last layer is sufficient to cluster the clients efficiently. The experiments show that using the last layer as input for the clustering algorithm transmits 99.68% fewer bytes to generate clusters compared to using all the neural network weights. Finally, our results show that ATHENA-FL correctly identifies samples, achieving up to 10.9% higher accuracy than traditional training. Furthermore, ATHENA-FL achieves lower training communication costs compared with MobileNet architecture, reducing the number of transmitted bytes between 25% and 97% across evaluated scenarios.