“…LLM Evaluation Frameworks Current LLM evaluation frameworks primarily focus on measuring the competence of LLM-based agents in handling structured outputs . Existing evaluations predominantly rely on pre-formatted prompts to assess code completion [Wu et al, 2023, Zhang et al, 2024a, Yao et al, 2023. While recent advancements have seen autonomous agents specializing in intricate data science tasks, including analysis, visualization, and modeling [Qian et al, 2023, 2024, evaluations for these methods often depend on extensive human effort or use more powerful LLMs to assess the output [Dubois et al, 2023, Belyi et al, 2024.…”