Proceedings of the 28th International Conference on Computational Linguistics 2020
DOI: 10.18653/v1/2020.coling-main.66
|View full text |Cite
|
Sign up to set email alerts
|

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Abstract: Although the Indonesian language is spoken by almost 200 million people and the 10th mostspoken language in the world, 1 it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the INDOLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release INDOBERT, a new pre-trained l… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

3
91
0
10

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
2
1

Relationship

0
8

Authors

Journals

citations
Cited by 134 publications
(104 citation statements)
references
References 42 publications
3
91
0
10
Order By: Relevance
“…Even two versions of IndoBERT models, [28] and [30], managed to classify IS queries with a 100% F1 score. Meanwhile, [29] got the highest F1 score of 98% when using a learning rate of 5e-5 and a batch size of 16. Thus, we employed this learning rate and batch size value for subsequent evaluations.…”
Section: Resultsmentioning
confidence: 99%
See 2 more Smart Citations
“…Even two versions of IndoBERT models, [28] and [30], managed to classify IS queries with a 100% F1 score. Meanwhile, [29] got the highest F1 score of 98% when using a learning rate of 5e-5 and a batch size of 16. Thus, we employed this learning rate and batch size value for subsequent evaluations.…”
Section: Resultsmentioning
confidence: 99%
“…6. Model of [28,30] converged faster than [29]. At the epoch 10 and 12, the accuracy of [28,30] yielded 100%, respectively.…”
Section: Resultsmentioning
confidence: 99%
See 1 more Smart Citation
“…Also, the studies produced several datasets that can support the information extraction process, such as the dataset from Gultom's research [7]and Syaifudin [14], which used a dataset that has been labeled tagging for the Named Entity Recognition (NER). Also, research builds Corpus for NLP Indonesian to support information extraction, namely Fajri Koto [15], which m arouses IndoLEM, dataset NLP for Indonesian and mele n gkapi dataset BERT for Language Indonesia, IndoBERT.…”
Section: Recent Workmentioning
confidence: 99%
“…Word-level toxicity classification can be formulated as a sequence labeling task, which also actively uses the pre-trained models mentioned above. BERT comprises the versatile information on words and their context, which allows to successfully use it for sequence labeling tasks of different levels: part-of-speech tagging and syntactic parsing (Koto et al, 2020), named entity recognition (Hakala and Pyysalo, 2019), semantic role labeling (He et al, 2019), detection of Machine Translation errors (Moura et al, 2020).…”
Section: Introductionmentioning
confidence: 99%