Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021 2021
DOI: 10.18653/v1/2021.findings-acl.18
|View full text |Cite
|
Sign up to set email alerts
|

CoDesc: A Large Code–Description Parallel Dataset

Abstract: Translation between natural language and source code can help software development by enabling developers to comprehend, ideate, search, and write computer programs in natural language. Despite growing interest from the industry and the research community, this task is often difficult due to the lack of large standard datasets suitable for training deep neural models, standard noise removal methods, and evaluation benchmarks. This leaves researchers to collect new small-scale datasets, resulting in inconsisten… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
1
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
3

Relationship

0
8

Authors

Journals

citations
Cited by 11 publications
(10 citation statements)
references
References 23 publications
0
1
0
Order By: Relevance
“…To substantiate this quality, we fine-tune prominent CodeLLMs on tasks that necessitate the involvement of both code and text, including code summarization, code search, and code generation. [Clement et al, 2020] 1 ≈ 7,700,000 -CoDesc [Hasan et al, 2021] 1 4,211,516 -CodeSearchNet [Husain et al, 2019] 6 2,326,976 4,125,470 CodeXGLUE CSN [Lu et al, 2021] 6 1,005,474 -Deepcom [Hu et al, 2020] 1 424,028 -CONCODE [Iyer et al, 2018b We then compare these models, which have been fine-tuned on The Vault, with those fine-tuned on CSN. The comparison is made using the same test datasets and commonly employed metrics, such as BLEU, MRR, and pass@k.…”
Section: Empirical Evaluationmentioning
confidence: 99%
See 1 more Smart Citation
“…To substantiate this quality, we fine-tune prominent CodeLLMs on tasks that necessitate the involvement of both code and text, including code summarization, code search, and code generation. [Clement et al, 2020] 1 ≈ 7,700,000 -CoDesc [Hasan et al, 2021] 1 4,211,516 -CodeSearchNet [Husain et al, 2019] 6 2,326,976 4,125,470 CodeXGLUE CSN [Lu et al, 2021] 6 1,005,474 -Deepcom [Hu et al, 2020] 1 424,028 -CONCODE [Iyer et al, 2018b We then compare these models, which have been fine-tuned on The Vault, with those fine-tuned on CSN. The comparison is made using the same test datasets and commonly employed metrics, such as BLEU, MRR, and pass@k.…”
Section: Empirical Evaluationmentioning
confidence: 99%
“…Table 4 offers a comparison between The Vault and other parallel datasets frequently used for pretraining and fine-tuning downstream tasks. These datasets include Funcom , Deepcom [Hu et al, 2020], CONCODE [Iyer et al, 2018b], CSN [Husain et al, 2019], CoDesc [Hasan et al, 2021], and non-public data used for pretraining [Clement et al, 2020, Ciurume-lea et al, 2020, Wang et al, 2021.…”
Section: Dataset Statisticsmentioning
confidence: 99%
“…To substantiate this quality, we fine-tune prominent CodeLLMs on tasks that necessitate the involvement of both code and text, including code summarization, code search, and code generation. 1 ≈ 7,700,000 -CoDesc [Hasan et al, 2021] 1 4,211,516 -CodeSearchNet [Husain et al, 2019] 6 2,326,976 4,125,470 CodeXGLUE CSN 6 1,005,474 -Deepcom [Hu et al, 2020] 1 424,028 -CONCODE [Iyer et al, 2018b We then compare these models, which have been fine-tuned on The Vault, with those fine-tuned on CSN. The comparison is made using the same test datasets and commonly employed metrics, such as BLEU, MRR, and pass@k.…”
Section: Empirical Evaluationmentioning
confidence: 99%
“…Table 4 offers a comparison between The Vault and other parallel datasets frequently used for pretraining and fine-tuning downstream tasks. These datasets include Funcom , Deepcom [Hu et al, 2020], CONCODE [Iyer et al, 2018b], CSN [Husain et al, 2019], CoDesc [Hasan et al, 2021], and non-public data used for pretraining , Ciurume-lea et al, 2020.…”
Section: Dataset Statisticsmentioning
confidence: 99%
“…The dataset FOL-codesc consists of pairs of natural language sentences of java code snippets and their first-order translations. We sampled pairs of natural language descriptions and their java code snippets from the recently published Codesc (Hasan et al, 2021) dataset consisting of 4.2M datapoints. We cut off the natural language descriptions after the first sentence and translated them into an FOL formula with the candc-boxer tool chain.…”
Section: Natural Language and Fol Formula Pairsmentioning
confidence: 99%