“…One of the difficulties in comparing prior work is the use of different performance metrics. Some examples are accuracy (Altakrori et al, 2021;Stamatatos, 2018;Jafariakinabad and Hua, 2022;Fabien et al, 2020;Saedi and Dras, 2021;Zhang et al, 2018;Barlas and Stamatatos, 2020), F1 (Murauer and Specht, 2021), C@1 (Bagnall, 2015), recall (Lagutina, 2021), precision (Lagutina, 2021), macro-accuracy (Bischoff et al, 2020), AUC (Bagnall, 2015;Pratanwanich and Lio, 2014), R@8 (Rivera-Soto et al, 2021), and the unweighted average of F1, F0.5u, C@1, and AUC (Manolache et al, 2021;Kestemont et al, 2021;Tyo et al, 2021;Futrzynski, 2021;Peng et al, 2021;Bönninghoff et al, 2021;Boenninghoff et al, 2020;Embarcadero-Ruiz et al, 2022;Weerasinghe et al, 2021).…”