“…For RQ1 and RQ2, we find that ChatGPT excels under human evaluation and semantic-aware automatic evaluation (see Figure 1 and Table 2). For RQ3, we find that the two methods may exhibit divergences when approaching machine-translated texts, contrary to Lu and Han (2023)'s findings that automated metrics can show moderate to strong correlations with human-assigned scores in assessing interpreting outputs, possibly due to the inherent differences between interpreting and translation. ing (Brown et al, 2020;Chowdhery et al, 2022;Wei et al, 2022a,b;Wang et al, 2022), and several studies have explored the influence of prompting strategies on the translation performance of LLMs (Jiao et al, 2023;Hendy et al, 2023;Peng et al, 2023;Chen et al, 2023;He et al, 2023).…”