CoDesc: A Large Code–Description Parallel Dataset

Hasan, Masum; Muttaqueen, Tanveer; Ishtiaq, Abdullah Al; Mehrab, Kazi Sajeed; Haque, Md. Mahim Anjum; Hasan, Tahmid; Ahmad, Wasi Uddin; Iqbal, Anindya; Shahriyar, Rifat

doi:10.18653/v1/2021.findings-acl.18

Cited by 11 publications

(10 citation statements)

References 23 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…To substantiate this quality, we fine-tune prominent CodeLLMs on tasks that necessitate the involvement of both code and text, including code summarization, code search, and code generation. [Clement et al, 2020] 1 ≈ 7,700,000 -CoDesc [Hasan et al, 2021] 1 4,211,516 -CodeSearchNet [Husain et al, 2019] 6 2,326,976 4,125,470 CodeXGLUE CSN [Lu et al, 2021] 6 1,005,474 -Deepcom [Hu et al, 2020] 1 424,028 -CONCODE [Iyer et al, 2018b We then compare these models, which have been fine-tuned on The Vault, with those fine-tuned on CSN. The comparison is made using the same test datasets and commonly employed metrics, such as BLEU, MRR, and pass@k.…”

Section: Empirical Evaluationmentioning

confidence: 99%

“…Table 4 offers a comparison between The Vault and other parallel datasets frequently used for pretraining and fine-tuning downstream tasks. These datasets include Funcom , Deepcom [Hu et al, 2020], CONCODE [Iyer et al, 2018b], CSN [Husain et al, 2019], CoDesc [Hasan et al, 2021], and non-public data used for pretraining [Clement et al, 2020, Ciurume-lea et al, 2020, Wang et al, 2021.…”

Section: Dataset Statisticsmentioning

confidence: 99%

See 1 more Smart Citation

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Manh,

Hai,

Dau

et al. 2023

Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

View full text Add to dashboard Cite

We present The Vault, an open-source dataset of high quality code-text pairs in multiple programming languages for training large language models to understand and generate code. We propose methods for thoroughly extracting samples that use both rules and deep learning to ensure that they contain high-quality pairs of code and text, resulting in a dataset of 42 million high-quality code-text pairs. We thoroughly evaluated this dataset and discovered that when used to train common code language models (such as CodeT5, CodeBERT, and CodeGen), it outperforms the same models train on other datasets such as CodeSearchNet. These evaluations included common coding tasks such as code generation, code summarization, and code search. The Vault can be used by researchers and practitioners to train a wide range of big language models that understand code. Alternatively, researchers can use our data cleaning methods and scripts to improve their own datasets. We anticipate that using The Vault to train large language models will improve their ability to understand and generate code, propelling AI research and software development forward. We are releasing our source code and a framework to make it easier for others to replicate our results.

show abstract

Section: Empirical Evaluationmentioning

confidence: 99%

Section: Dataset Statisticsmentioning

confidence: 99%

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Manh,

Hai,

Dau

et al. 2023

Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

View full text Add to dashboard Cite

show abstract

“…To substantiate this quality, we fine-tune prominent CodeLLMs on tasks that necessitate the involvement of both code and text, including code summarization, code search, and code generation. 1 ≈ 7,700,000 -CoDesc [Hasan et al, 2021] 1 4,211,516 -CodeSearchNet [Husain et al, 2019] 6 2,326,976 4,125,470 CodeXGLUE CSN 6 1,005,474 -Deepcom [Hu et al, 2020] 1 424,028 -CONCODE [Iyer et al, 2018b We then compare these models, which have been fine-tuned on The Vault, with those fine-tuned on CSN. The comparison is made using the same test datasets and commonly employed metrics, such as BLEU, MRR, and pass@k.…”

Section: Empirical Evaluationmentioning

confidence: 99%

Section: Dataset Statisticsmentioning

confidence: 99%

Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

2023

View full text Add to dashboard Cite

We introduce calamanCy, an open-source toolkit for constructing natural language processing (NLP) pipelines for Tagalog. It is built on top of spaCy, enabling easy experimentation and integration with other frameworks. calamanCy addresses the development gap by providing a consistent API for building NLP applications and offering general-purpose multitask models with out-of-the-box support for dependency parsing, parts-of-speech (POS) tagging, and named entity recognition (NER). calamanCy aims to accelerate the progress of Tagalog NLP by consolidating disjointed resources in a unified framework. The cala-manCy toolkit is available on GitHub: https: //github.com/ljvmiranda921/calamanCy.

show abstract

“…The dataset FOL-codesc consists of pairs of natural language sentences of java code snippets and their first-order translations. We sampled pairs of natural language descriptions and their java code snippets from the recently published Codesc (Hasan et al, 2021) dataset consisting of 4.2M datapoints. We cut off the natural language descriptions after the first sentence and translated them into an FOL formula with the candc-boxer tool chain.…”

Section: Natural Language and Fol Formula Pairsmentioning

confidence: 99%

Formal Specifications from Natural Language

Hahn¹,

Schmitt²,

Tillman³

et al. 2022

Preprint

View full text Add to dashboard Cite

We study the ability of language models to translate natural language into formal specifications with complex semantics. In particular, we fine-tune off-the-shelf language models on three datasets consisting of structured English sentences and their corresponding formal representation: 1) First-order logic (FOL), commonly used in software verification and theorem proving; 2) linear-time temporal logic (LTL), which forms the basis for industrial hardware specification languages; and 3) regular expressions (regex), frequently used in programming and search. Our experiments show that, in these diverse domains, the language models achieve competitive performance to the respective state-of-the-art with the benefits of being easy to access, cheap to fine-tune, and without a particular need for domainspecific reasoning. Additionally, we show that the language models have a unique selling point: they benefit from their generalization capabilities from pre-trained knowledge on natural language, e.g., to generalize to unseen variable names.

show abstract

CoDesc: A Large Code–Description Parallel Dataset

Cited by 11 publications

References 23 publications

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

The Vault: A Comprehensive Multilingual Dataset for Advancing Code Understanding and Generation

Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)

Formal Specifications from Natural Language

Contact Info

Product

Resources

About