One major challenge multi‐label classification faces, are the conditions for evaluating multi‐label algorithms. Simplistic experimental setups based on artificial data may not capture crucial situations for analyzing these algorithms. This article introduces an experimental framework for evaluating multi‐label algorithms by artificially generating the probabilistic label distributions. The proposed framework has the benefits of considering a wide variety of labels distributions, and enables users to simulate probability label distributions with a better control of the label dependence and the difficulty of the problem. An experimental study was conducted using the framework where new findings with respect to five methods, binary relevance, classifier chain, dependent binary relevance, calibrated label ranking by pairwise comparison and probabilistic classifier chain, were revealed. This framework will facilitate conducting new experimental studies for analysing the effects of changing label dependence and the difficulty of the problem on the performance of new multi‐label algorithms.