Statistical machine translation outperforms neural machine translation in software engineering: why and how

Phan, Hung; Jannesari, Ali

doi:10.1145/3416506.3423576

Cited by 9 publications

(8 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…NMT models: Before the era of NMT, Statistical Machine Translation (SMT) [15] was the most popular technique for software engineering (SE) problems, it still outperforms NMT in some SE problems [71]. However, since we are interested in the specific problem of code generation, we focus on NMT that has shown superior performance on public benchmarks [9], and that it is widely recognized as the premier method for the translation of different languages [83].…”

Section: Threats To Validitymentioning

confidence: 99%

Can We Generate Shellcodes via Natural Language? An Empirical Study

Liguori,

Al-Hossami,

Cotroneo

et al. 2022

Preprint

View full text Add to dashboard Cite

Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially timeconsuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach based on Neural Machine Translation (NMT). We then present an empirical study using a novel dataset (Shellcode IA32), which consists of 3, 200 assembly code snippets of real Linux/x86 shellcodes from public databases, annotated using natural language. Moreover, we propose novel metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.

show abstract

Section: Threats To Validitymentioning

confidence: 99%

Can We Generate Shellcodes via Natural Language? An Empirical Study

Liguori,

Al-Hossami,

Cotroneo

et al. 2022

Preprint

View full text Add to dashboard Cite

show abstract

“…NMT models Before the era of NMT, Statistical Machine Translation (SMT) Costa-Jussá and Farrús (2014) was the most popular technique for software engineering (SE) problems, it still outperforms NMT in some SE problems (Phan and Jannesari 2020). However, since we are interested in the specific problem of code generation, we focus on NMT that has shown superior performance on public benchmarks (Bojar et al 2016), and that it is widely recognized as the premier method for the translation of different languages (Wu et al 2016).…”

Section: Threats To Validitymentioning

confidence: 99%

Can we generate shellcodes via natural language? An empirical study

et al. 2022

View full text Add to dashboard Cite

Writing software exploits is an important practice for offensive security analysts to investigate and prevent attacks. In particular, shellcodes are especially time-consuming and a technical challenge, as they are written in assembly language. In this work, we address the task of automatically generating shellcodes, starting purely from descriptions in natural language, by proposing an approach based on Neural Machine Translation (NMT). We then present an empirical study using a novel dataset (Shellcode_IA32), which consists of 3200 assembly code snippets of real Linux/x86 shellcodes from public databases, annotated using natural language. Moreover, we propose novel metrics to evaluate the accuracy of NMT at generating shellcodes. The empirical analysis shows that NMT can generate assembly code snippets from the natural language with high accuracy and that in many cases can generate entire shellcodes with no errors.

show abstract

“…Li et al [180] conducted experiments on two datasets to demonstrate the effectiveness of their approach consisting of an attention mechanism and a pointer mixture network on code completion tasks. Phan and Jannesari [245] used three corpus for their experiments-a large-scale corpus of English-German translation in nlp [201], the Conala corpus [356], which contains Python software documentation as 116,000 English sentences, and the msr 2013 corpus [18]. Schuster et al [275] used a public archive of GitHub from 2020 [1].…”

Section: Data Collectionmentioning

confidence: 99%

“…Gopalakrishnan et al [109] extracted relationships between topical concepts in the source code and the use of specific architectural developer tactics in that code. Phan and Jannesari [245] used machine translation to learn the mapping from prefixes to code tokens for code suggestion. They extracted the tokens from the documentation of the source code.…”

Section: Data Collectionmentioning

confidence: 99%

See 1 more Smart Citation

A Survey on Machine Learning Techniques for Source Code Analysis

Sharma¹,

Kechagia²,

Georgiou³

et al. 2021

Preprint

View full text Add to dashboard Cite

Context:The advancements in machine learning techniques have encouraged researchers to apply these techniques to a myriad of software engineering tasks that use source code analysis such as testing and vulnerabilities detection. A large number of studies poses challenges to the community to understand the current landscape. Objective: We aim to summarize the current knowledge in the area of applied machine learning for source code analysis. Method: We investigate studies belonging to twelve categories of software engineering tasks and corresponding machine learning techniques, tools, and datasets that have been applied to solve them. To do so, we carried out an extensive literature search and identified 364 primary studies published between 2002 and 2021. We summarize our observations and findings with the help of the identified studies. Results: Our findings suggest that the usage of machine learning techniques for source code analysis tasks is consistently increasing. We synthesize commonly used steps and the overall workflow for each task, and summarize the employed machine learning techniques. Additionally, we collate a comprehensive list of available datasets and tools useable in this context. Finally, we summarize the perceived challenges in this area that include availability of standard datasets, reproducibility and replicability, and hardware resources. CCS Concepts: • Software and its engineering → Software libraries and repositories; Software maintenance tools; Software post-development issues; Maintaining software; • Computing methodologies → Machine learning.

show abstract

Statistical machine translation outperforms neural machine translation in software engineering: why and how

Cited by 9 publications

References 37 publications

Can We Generate Shellcodes via Natural Language? An Empirical Study

Can We Generate Shellcodes via Natural Language? An Empirical Study

Can we generate shellcodes via natural language? An empirical study

A Survey on Machine Learning Techniques for Source Code Analysis

Contact Info

Product

Resources

About