Jain, Aneesh scite author profile

Jain, Aneesh

4Publications

11Citation Statements Received

43Citation Statements Given

How they've been cited

How they cite others

Affiliations

Publications

Order By: Most citations

Execution-based Code Generation using Deep Reinforcement Learning

Shojaee¹,

Aneesh²,

Tipirneni³

et al. 2023

Preprint

View full text Add to dashboard Cite

The utilization of programming language (PL) models, pretrained on large-scale code corpora, as a means of automating software engineering processes has demonstrated considerable potential in streamlining various code generation tasks such as code completion, code translation, and program synthesis. However, current approaches mainly rely on supervised fine-tuning objectives borrowed from text generation, neglecting specific sequencelevel features of code, including but not limited to compilability as well as syntactic and functional correctness. To address this limitation, we propose PPOCoder, a new framework for code generation that combines pretrained PL models with Proximal Policy Optimization (PPO) deep reinforcement learning and employs execution feedback as the external source of knowledge into the model optimization. PPOCoder is transferable across different code generation tasks and PLs. Extensive experiments on three code generation tasks demonstrate the effectiveness of our proposed approach compared to SOTA methods, improving the success rate of compilation and functional correctness over different PLs. Our code can be found at https: //github.com/reddy-lab-code-research/PPOCoder.

show abstract

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Zhu¹,

Aneesh²,

Suresh³

et al. 2022

Preprint

View full text Add to dashboard Cite

Recent advances in machine learning have significantly improved the understanding of source code data and achieved good performance on a number of downstream tasks. Open source repositories like GitHub enable this process with rich unlabeled code data. However, the lack of high quality labeled data has largely hindered the progress of several code related tasks, such as program translation, summarization, synthesis, and code search. This paper introduces XLCoST , Cross-Lingual Code SnippeT dataset, a new benchmark dataset for cross-lingual code intelligence. Our dataset contains fine-grained parallel data from 8 languages (7 commonly used programming languages and English), and supports 10 cross-lingual code tasks. To the best of our knowledge, it is the largest parallel dataset for source code both in terms of size and the number of languages. We also provide the performance of several state-of-the-art baseline models for each task. We believe this new dataset can be a valuable asset for the research community and facilitate the development and validation of new methods for cross-lingual code intelligence 1 . IntroductionRecent advances in machine learning have benefited a number of code related tasks, such as code translation, code summarization, and code synthesis. Open-source code repository websites like Github provide enormous amount of source code data, which enables the training of large-scale programming language models such as CodeBERT (Feng et al., 2020), PLBART (Ahmad et al., 2021a), TransCoder (Roziere et al., 2020) and CodeT5 (Wang et al., 2021. These extensively pre-trained models have shown superior performance on benchmark datasets like CodeXGLUE (Lu et al., 2021).Although open-source code data is abundant in quantity, it has several disadvantages when being used as training data for code-related models. First, most of the available code data is unlabeled. For tasks like Code Translation, Code Summarization, and Code Synthesis, high quality parallel data is critical for model training. However, it is difficult to mine parallel data from open-source projects. Second, labeled data is usually small in size. For example, the code translation data introduced in Zhu et al. ( 2022) only has around 70 programs for testing and 50 programs for validation. Due to the small size of evaluation data, the models trained on this dataset may not be thoroughly evaluated. Moreover, the available labeled datasets usually only cover a limited number of languages. For example, the Code Translation dataset in CodeXGLUE only covers 2 languages, Java and C#. Because of the scarcity of labeled data in some programming languages, code tasks in some low-resource languages remain unexplored.

show abstract

Team Cadence at MEDIQA-Chat 2023: Generating, augmenting and summarizing clinical dialogue with large language models

Sharma¹,

Feldman²,

Aneesh³

2023

View full text Add to dashboard Cite

ADEPT: Adapter-based Efficient Prompt Tuning Approach for Language Models

Shah¹,

Thapa²,

Aneesh³

et al. 2023

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Jain, Aneesh

Execution-based Code Generation using Deep Reinforcement Learning

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Team Cadence at MEDIQA-Chat 2023: Generating, augmenting and summarizing clinical dialogue with large language models

ADEPT: Adapter-based Efficient Prompt Tuning Approach for Language Models

Contact Info

Product

Resources

About