“…It is especially active since the recognition that decades old metrics such as BLEU (Papineni et al, 2002) and ROUGE (Lin, 2004) are inadequate for evaluation (Mathur et al, 2020;Peyrard, 2019;Freitag et al, 2022). The focus in recent years is on developing high-quality LLM based metrics (Zhang et al, 2020;Zhao et al, 2019) that are (among others) explainable (Kaster et al, 2021;Leiter et al, 2022aLeiter et al, , 2023aLeiter et al, , 2022bSai et al, 2021), efficient (Kamal Eddine et al, 2022Grünwald et al, 2022;Zouhar et al, 2023;Belouadi and Eger, 2023), robust (Chen and Eger, 2023;Rony et al, 2022), and reproducible (Chen et al, 2022;Grusky, 2023). The focus of Eval4NLP's Shared Task is on explainable high-quality metrics induced from prompting the most recent classes of LLMs including variants of LLaMA (Touvron et al, 2023).…”