AI systems for decision‐making have become increasingly popular in several areas. However, it is possible to identify biased decisions in many applications, which have become a concern for the computer science, artificial intelligence, and law communities. Therefore, researchers are proposing solutions to mitigate bias and discrimination among decision‐makers. Some explored strategies are based on GANs to generate fair data. Others are based on adversarial learning to achieve fairness by encoding fairness constraints through an adversarial model. Moreover, it is usual for each proposal to assess its model with a specific metric, making comparing current approaches a complex task. Therefore, this work proposes a systematical benchmark procedure to assess the fair machine learning models. The proposed procedure comprises a fairness‐utility trade‐off metric (), the utility and fairness metrics to compose this assessment, the used datasets and preparation, and the statistical test. A previous work presents some of these definitions. The present work enriches the procedure by increasing the applied datasets and statistical guarantees when comparing the models' results. We performed this benchmark evaluation for the non‐generative adversarial models, analyzing the literature models from the same metric perspective. This assessment could not indicate a single model which better performs for all datasets. However, we built an understanding of how each model performs on each dataset with statistical confidence.