Active Programming by Example with a Natural Language Prior

Zhong, Ruiqi; Snell, Charlie; Klein, Dan; Eisner, Jason

doi:10.48550/arxiv.2205.12422

Cited by 2 publications

(2 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Generally, a query that is equivalent (but not identical) to ground truth may be mistakenly classified as incorrect by automated evaluation metrics. Another study by Zhong et al (2022) identifies limitations within the Spider benchmark, such as issues with ties and certain syntactic problems. Their analysis is primarily focused on a subset of Spider, without quantifying the extent or impact of these limitations or conducting an assessment of other benchmarks.…”

Section: Related Workmentioning

confidence: 99%

Evaluating Cross-Domain Text-to-SQL Models and Benchmarks

Pourreza,

Rafiei

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Text-to-SQL benchmarks play a crucial role in evaluating the progress made in the field and the ranking of different models. However, accurately matching a model-generated SQL query to a reference SQL query in a benchmark fails for various reasons, such as underspecified natural language queries, inherent assumptions in both model-generated and reference queries, and the non-deterministic nature of SQL output under certain conditions. In this paper, we conduct an extensive study of several prominent cross-domain text-to-SQL benchmarks and reevaluate some of the top-performing models within these benchmarks, by both manually evaluating the SQL queries and rewriting them in equivalent expressions. Our evaluation reveals that attaining a perfect performance on these benchmarks is unfeasible due to the multiple interpretations that can be derived from the provided samples. Furthermore, we find that the true performance of the models is underestimated and their relative performance changes after a re-evaluation. Most notably, our evaluation reveals a surprising discovery: a recent GPT4-based model surpasses the gold standard reference queries in the Spider benchmark in our human evaluation. This finding highlights the importance of interpreting benchmark evaluations cautiously, while also acknowledging the critical role of additional independent evaluations in driving advancements in the field.

show abstract

Section: Related Workmentioning

confidence: 99%

Evaluating Cross-Domain Text-to-SQL Models and Benchmarks

Pourreza,

Rafiei

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…Other important aspects studied include length generalization (Anil et al, 2022), compositional generalization (Shi et al, 2022), reverse engineering (Pearce et al, 2022), and generating development tools (Bareiß et al, 2022). The task of NL to Code is broadly of interest to the semantic parsing literature (Kamath and Das, 2018;Zhong et al, 2022;.…”

Section: Program Synthesismentioning

confidence: 99%

InstructExcel: A Benchmark for Natural Language Instruction in Excel

Payan,

Mishra,

Singh

et al. 2023

Findings of the Association for Computational Linguistics: EMNLP 2023

View full text Add to dashboard Cite

With the evolution of Large Language Models (LLMs) we can solve increasingly more complex NLP tasks across various domains, including spreadsheets. This work investigates whether LLMs can generate code (Excel Office-Scripts, a TypeScript API for executing many tasks in Excel) that solves Excel specific tasks provided via natural language user instructions. To do so we introduce a new large-scale benchmark, INSTRUCTEXCEL, 1 created by leveraging the 'Automate' feature in Excel to automatically generate OfficeScripts from users' actions. Our benchmark includes over 10k samples covering 170+ Excel operations across 2,000 publicly available Excel spreadsheets. Experiments across various zero-shot and few-shot settings show that INSTRUCTEXCEL is a hard benchmark for state of the art models like GPT-4. We observe that (1) using GPT-4 over GPT-3.5, (2) providing more in-context examples, and (3) dynamic prompting can help improve performance on this benchmark.

show abstract

Active Programming by Example with a Natural Language Prior

Cited by 2 publications

References 30 publications

Evaluating Cross-Domain Text-to-SQL Models and Benchmarks

Evaluating Cross-Domain Text-to-SQL Models and Benchmarks

InstructExcel: A Benchmark for Natural Language Instruction in Excel

Contact Info

Product

Resources

About