Natural language generation (NLG) benchmarks provide an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. Here we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three low-resource-yet widely spokenlanguages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. Concretely, IndoNLG covers six tasks: summarization, question answering, chit-chat, and three different pairs of machine translation (MT) tasks. We collate a clean pretraining corpus of Indonesian, Sundanese, and Javanese datasets, Indo4B-Plus, which is used to pretrain our models: IndoBART and IndoGPT. We show that IndoBART and IndoGPT achieve competitive performance on all tasks-despite using only one-fifth the parameters of a larger multilingual model, mBART LARGE . This finding emphasizes the importance of pretraining on closely related, local languages to achieve more efficient learning and faster inference for very low-resource languages like Javanese and Sundanese. 1 * These authors contributed equally. 1 Beyond the clean pretraining data, we publicly release all pretrained models and tasks at https://github.com/ indobenchmark/indonlg to facilitate NLG research in these languages.
Natural language generation (NLG) benchmarks constitute an important avenue to measure progress and develop better NLG systems. Unfortunately, the lack of publicly available NLG benchmarks for the vast majority of the world's low-resource languages poses a challenging barrier for building NLG systems that work well for languages with limited amounts of data. In this paper, we introduce IndoNLG, the first benchmark to measure natural language generation (NLG) progress in three lowresource-yet widely spoken-languages of Indonesia: Indonesian, Javanese, and Sundanese. Altogether, these languages are spoken by more than 100 million native speakers, and hence constitute an important use case of NLG systems today. IndoNLG covers six commonly used NLG tasks: summarization, question answering, chitchat, and three different pairs of machine translation (MT) tasks. To this end, we provide a vast and clean pre-training corpus of Indonesian, Sundanese, and Javanese datasets called Indo4B-Plus, which is used to train our pre-trained NLG model, IndoBART. Our findings show that In-doBART achieves competitive performance on the Indonesian tasks-despite using five times fewer parameters compared to the largest multilingual model we use, mBART LARGE , whilst achieving up to 4x speedup in inference time. We additionally demonstrate the ability of IndoBART to perform well on Javanese and Sundanese, as evidenced by respectable performance on MT tasks.
Although Indonesian is the fourth most frequently used language on the internet, the development of NLP in Indonesian has not been studied intensively. One form of NLP application classified as an NLG task is the Automatic Question Generation task. Generally, the task has proven well, using rule-based and cloze tests, but these approaches depend heavily on the defined rules. While this approach is suitable for automated question generation systems on a small scale, it can become less efficient as the scale of the system grows. Many NLG model architectures have recently proven to have significantly improved performance compared to previous architectures, such as generative pre-trained transformers, text-to-text transfer transformers, bidirectional autoregressive transformers, and many more. Previous studies on AQG in Indonesian were built on RNN-based architecture such as GRU, LSTM, and Transformer. The performance of models in previous studies is compared with state-of-the-art models, such as multilingual models mBART and mT5, and monolingual models such as IndoBART and IndoGPT. As a result, the fine-tuned IndoBART performed significantly higher than either BiGRU and BiLSTM on the SQuAD dataset. Fine-tuned IndoBART on most of the metrics also performed better on the TyDiQA dataset only, which has fewer population than the SQuAD dataset.Povzetek: Za indonezijščino, četrti najpogostejši spletni jezik, so za poučevanje razvili jezikovni pretvornik iz besedila v vprašanja.
At the center of the underlying issues that halt Indonesian natural language processing (NLP) research advancement, we find data scarcity. Resources in Indonesian languages, especially the local ones, are extremely scarce and underrepresented. Many Indonesian researchers refrain from publishing and/or releasing their dataset. Furthermore, the few public datasets that we have are scattered across different platforms, thus makes performing reproducible and data-centric research in Indonesian NLP even more arduous. Rising to this challenge, we initiate the first Indonesian NLP crowdsourcing effort, NusaCrowd. NusaCrowd strives to provide the largest datasheet aggregation with standardized data loading for NLP tasks in all Indonesian languages. By enabling open and centralized access to Indonesian NLP resources, we hope NusaCrowd can tackle the data scarcity problem hindering NLP progress in Indonesia and bring NLP practitioners to move towards collaboration.
No abstract
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.