2021
DOI: 10.48550/arxiv.2105.09938
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Measuring Coding Challenge Competence With APPS

Abstract: While programming is one of the most broadly applicable skills in modern society, modern machine learning models still cannot code solutions to basic problems. Despite its importance, there has been surprisingly little work on evaluating code generation, and it can be difficult to accurately assess code generation performance rigorously. To meet this challenge, we introduce APPS, a benchmark for code generation. Unlike prior work in more restricted settings, our benchmark measures the ability of models to take… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

1
52
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
4
3
1

Relationship

0
8

Authors

Journals

citations
Cited by 26 publications
(53 citation statements)
references
References 26 publications
1
52
0
Order By: Relevance
“…The explainability categories we identified have varied technical feasibility with current techniques, and point to topics that are under-explored for generative AI. For example, for the Performance category, existing works have used the Computational Accuracy metric to evaluate generative code models [9,15,33,88], but not other metrics we uncovered regarding the characteristics of the generated artifacts and run-time efficiency. To understand performance differences and limitations with regard to different types of input, solutions have been explored for natural language generation under Prompt Engineering [57,58].…”
Section: Discussion 61 Informing Xai Approaches For Genai For Codementioning
confidence: 99%
“…The explainability categories we identified have varied technical feasibility with current techniques, and point to topics that are under-explored for generative AI. For example, for the Performance category, existing works have used the Computational Accuracy metric to evaluate generative code models [9,15,33,88], but not other metrics we uncovered regarding the characteristics of the generated artifacts and run-time efficiency. To understand performance differences and limitations with regard to different types of input, solutions have been explored for natural language generation under Prompt Engineering [57,58].…”
Section: Discussion 61 Informing Xai Approaches For Genai For Codementioning
confidence: 99%
“…Large-scale natural language modeling has witnessed rapid advances since the inception of the Transformer architecture [46]. It has been shown by recent works that large language models (LLMs) pre-trained on large unstructured text corpus not only can perform strongly on various down-stream NLP tasks [10,33,34,5] but the learned representations can also be used to model relations of entities [20], retrieve matching visual features [17], synthesize code from docstrings [13,7], solve math problems [8,39], and even as valuable priors when applied to diverse tasks from different modalities [23,45]. Notably, by pre-training on large-scale data, these models can also internalize an implicit knowledge base containing rich information about the world from which factual answers (e.g.…”
Section: Related Workmentioning
confidence: 99%
“…To bypass this limitation, Roziere et al (2020) used unsupervised neural machine translation techniques to translate between languages using only monolingual corpora, and showed impressive results for translation between Java, C++, and Python. While Roziere et al (2020) trained the model specifically for code translation, large language models -such as GPT-2 (Radford et al, 2019), GPT-3 (Brown et al, 2020), and Codex -have also been shown to have some competence in generating code (Hendrycks et al, 2021).…”
Section: Related Workmentioning
confidence: 99%