Learning to mine aligned code and natural language pairs from stack overflow

Yin, Pengcheng; Deng, Bowen; Chen, Edgar; Vasilescu, Bogdan; Neubig, Graham

doi:10.1145/3196398.3196408

Cited by 163 publications

(138 citation statements)

References 42 publications

Supporting

Mentioning

138

Contrasting

Order By: Relevance

“…Although various large scale datasets Sutton, 2013, 2014;Allamanis et al, 2016; to study code generation have been created from Github, their development and test set are randomly created from the same dataset since human curation is prohibitively expensive. Similarly, Yin et al (2018) collect a large dataset from Stackoverflow.com (CoNaLa) for training, but only manage to curate a small portion (∼ 2,900 examples) of single line NL and code snippets for evaluation. We take advantage of nbgrader assignment notebooks to create an inexpensive highquality human-curated test set of 3,725 NL statements with interactive history.…”

Section: Related Workmentioning

confidence: 99%

“…Existing tasks for mapping NL to source code primarily use a single NL utterance (Zettlemoyer and Collins, 2005;Iyer et al, 2017) to generate database queries (semantic parsing), single line python code (Yin et al, 2018;Oda et al, 2015), multi-line domain-specific code (Ling et al, 2016;Rabinovich et al, 2017), or sequences of API calls (Gu et al, 2016b). A recent task by on the CONCODE dataset maps a single utterance to an entire method, conditioned on environment variables and methods.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

Agashe¹,

Iyer²,

Zettlemoyer³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Interactive programming with interleaved code snippet cells and natural language markdown is recently gaining popularity in the form of Jupyter notebooks, which accelerate prototyping and collaboration. To study code generation conditioned on a long context history, we present JuICe, a corpus of 1.5 million examples with a curated test set of 3.7K instances based on online programming assignments. Compared with existing contextual code generation datasets, JuICe provides refined human-curated data, open-domain code, and an order of magnitude more training data. Using JuICe, we train models for two tasks:(1) generation of the API call sequence in a code cell, and (2) full code cell generation, both conditioned on the NL-Code history up to a particular code cell. Experiments using current baseline code generation models show that both context and distant supervision aid in generation, and that the dataset is challenging for current systems.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

Agashe¹,

Iyer²,

Zettlemoyer³

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

show abstract

“…It uses the Floating Parser architecture, which is a grammar-based approach that provides more flexibility without requiring hand-engineering of lexicalized rules like synchronous CFG or CCG based semantic parsers [42]. This approach also provides more interpretable results and requires less training data than neural network approaches (e.g., [51,52]). The parser parses user utterances into expressions in a simple functional DSL we created for PUMICE.…”

Section: Semantic Parsingmentioning

confidence: 99%

PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations

Radensky

Jia

et al. 2019

Proceedings of the 32nd Annual ACM Symposium on User Interface Software and Technology

108

View full text Add to dashboard Cite

Figure 1. Example structure of how PUMICE learns the concepts and procedures in the command "If it's hot, order a cup of Iced Cappuccino." The numbers indicate the order of utterances. The screenshot on the right shows the conversational interface of PUMICE. In this interactive parsing process, the agent learns how to query the current temperature, how to order any kind of drink from Starbucks, and the generalized concept of "hot" as "a temperature (of something) is greater than another temperature". ABSTRACTNatural language programming is a promising approach to enable end users to instruct new tasks for intelligent agents. However, our formative study found that end users would often use unclear, ambiguous or vague concepts when naturally instructing tasks in natural language, especially when specifying conditionals. Existing systems have limited support for letting the user teach agents new concepts or explaining unclear concepts. In this paper, we describe a new multimodal domain-independent approach that combines natural language programming and programming-by-demonstration to allow users to first naturally describe tasks and associated conditions at a high level, and then collaborate with the agent to recursively resolve any ambiguities or vagueness through conversations and demonstrations. Users can also define new procedures and concepts by demonstrating and referring to contents within GUIs of existing mobile apps. We demonstrate this approach in PUMICE, an end-user programmable agent that implements this approach. A lab study with 10 users showed its usability.

show abstract

“…11.9 y self. max entries = int(max entries) 8.2 CONALA x more pythonic alternative for getting a value in range not using min and max 9.7 y a = 1 if x < 1 else 10 if x > 10 else x 14.1 Figure 2: Sample natural language utterances and meaning representations from datasets used in this work: ATIS for dialogue management; DJANGO (Oda et al, 2015) and CONALA (Yin et al, 2018a) for code generation and summarization.…”

Section: Atismentioning

confidence: 99%

“…is reported 62.3 SNM (Yin and Neubig, 2017) 71.6 COARSE2FINE (Dong and Lapata, 2018) 74 (Hu et al, 2018) 65.9 for parser evaluation based on exact match, and BLEU-4 is adopted for generator evaluation. For the code generation task in CONALA, we use BLEU-4 following the setup in Yin et al (2018a).…”

Section: Experimental Setupsmentioning

confidence: 99%

Jointly Learning Semantic Parser and Natural Language Generator via Dual Information Maximization

Hai

Wang

2019

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

View full text Add to dashboard Cite

Semantic parsing aims to transform natural language (NL) utterances into formal meaning representations (MRs), whereas an NL generator achieves the reverse: producing a NL description for some given MRs. Despite this intrinsic connection, the two tasks are often studied separately in prior work. In this paper, we model the duality of these two tasks via a joint learning framework, and demonstrate its effectiveness of boosting the performance on both tasks. Concretely, we propose the method of dual information maximization (DIM) to regularize the learning process, where DIM empirically maximizes the variational lower bounds of expected joint distributions of NL and MRs. We further extend DIM to a semisupervision setup (SEMIDIM), which leverages unlabeled data of both tasks. Experiments on three datasets of dialogue management and code generation (and summarization) show that performance on both semantic parsing and NL generation can be consistently improved by DIM, in both supervised and semi-supervised setups 1 .

show abstract

Learning to mine aligned code and natural language pairs from stack overflow

Cited by 163 publications

References 42 publications

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

JuICe: A Large Scale Distantly Supervised Dataset for Open Domain Context-based Code Generation

PUMICE: A Multi-Modal Agent that Learns Concepts and Conditionals from Natural Language and Demonstrations

Jointly Learning Semantic Parser and Natural Language Generator via Dual Information Maximization

Contact Info

Product

Resources

About