“…These primarily focus on natural language to Python code generation: HumanEval (Chen et al, 2021), HumanEval+ (Liu et al, 2023b), APPS (Hendrycks et al, 2021), Code-Contests , MBPP , L2CEval (Ni et al, 2023). Their variants have been proposed to cover more languages, (Wang et al, 2022a;Cassano et al, 2022;Athiwaratkun et al, 2022). Many benchmarks have focused on code generation in APIs.…”