Indosum: A New Benchmark Dataset for Indonesian Text Summarization

Kurniawan, Kemal; Louvan, Samuel

doi:10.1109/ialp.2018.8629109

Cited by 40 publications

(38 citation statements)

References 21 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Because the sentiment analysis data is low-resource and imbalanced, we use stratified 5-fold cross-validation, and evaluate based on F1 score. For summarization, on the other hand, we use the canonical splits provided by Kurniawan and Louvan (2018), and evaluate the resulting summary with ROUGE (F1) (Lin, 2004) in the form of three different metrics: R1, R2, and RL.…”

Section: Evaluation Methodologymentioning

confidence: 99%

“…Aristoteles et al (2012) deployed a genetic algorithm over a 200-document summarization dataset, and Gunawan et al (2017) performed unsupervised summarization over 3,075 news articles. As an attempt to create a standardized corpus, Koto (2016) released a 300-document chat summarization dataset, and Kurniawan and Louvan (2018) released the IndoSum 19K document-summary dataset. At the time we carried out this work, 20 IndoSum was the largest Indonesian summarization corpus in the news domain, manually constructed from CNN Indonesia 21 and Kumparan 22 documents.…”

Section: Semantic Tasksmentioning

confidence: 99%

“…IndoSum is a single-document summarization dataset where each article has one abstractive summary. Kurniawan and Louvan (2018) released IndoSum together with the ORACLE -a set of extractive summaries generated automatically by maximizing ROUGE score between sentences of the article and its abstractive summary. We include IndoSum as the summarization dataset in INDOLEM, and evaluate the performance of extractive summarization in this paper.…”

Section: Semantic Tasksmentioning

confidence: 99%

“…For extractive summarization baselines, we use the models of Kurniawan and Louvan (2018) and Cheng and Lapata (2016) as baselines. Kurniawan and Louvan (2018) propose a sentence tagging approach based on a hidden Markov model, while Cheng and Lapata (2016) Table 2: Comparison of baselines and BERT-based models for all INDOLEM tasks. All listed models were implemented and run by the authors, except for those marked with " †" where the results are sourced from the original paper.…”

Section: Baselinesmentioning

confidence: 99%

See 3 more Smart Citations

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Koto¹,

Rahimi²,

Lau³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

135

View full text Add to dashboard Cite

Although the Indonesian language is spoken by almost 200 million people and the 10th mostspoken language in the world, 1 it is under-represented in NLP research. Previous work on Indonesian has been hampered by a lack of annotated datasets, a sparsity of language resources, and a lack of resource standardization. In this work, we release the INDOLEM dataset comprising seven tasks for the Indonesian language, spanning morpho-syntax, semantics, and discourse. We additionally release INDOBERT, a new pre-trained language model for Indonesian, and evaluate it over INDOLEM, in addition to benchmarking it against existing resources. Our experiments show that INDOBERT achieves state-of-the-art performance over most of the tasks in INDOLEM.

show abstract

Section: Evaluation Methodologymentioning

confidence: 99%

Section: Semantic Tasksmentioning

confidence: 99%

Section: Semantic Tasksmentioning

confidence: 99%

Section: Baselinesmentioning

confidence: 99%

See 2 more Smart Citations

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Koto¹,

Rahimi²,

Lau³

et al. 2020

Proceedings of the 28th International Conference on Computational Linguistics

135

View full text Add to dashboard Cite

show abstract

“…Further improvements are introduced to the baseline model by using the pointer generator network and coverage mechanism using reinforcement learning based training procedure (See et al, 2017;Paulus et al, 2017). There is an inherent limitation to natural language processing tasks such as text summarization for resource-poor and morphological complex languages owing to a shortage of quality linguistic data available (Kurniawan and Louvan, 2018). The use of synthetic data along with the real data is one of the popular approaches followed in machine translation domain for the low resource conditions to improve the translation quality (Bojar and Tamchyna, 2011;Hoang et al, 2018;Chinea-Rıos et al, 2017).…”

Section: Introductionmentioning

confidence: 99%

Abstract Text Summarization: A Low Resource Challenge

Parida¹,

Motlíček²

2019

Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conferen

View full text Add to dashboard Cite

Text summarization is considered as a challenging task in the NLP community. The availability of datasets for the task of multilingual text summarization is rare, and such datasets are difficult to construct. In this work, we build an abstract text summarizer for the German language text using the state-of-the-art "Transformer" model. We propose an iterative data augmentation approach which uses synthetic data along with the real summarization data for the German language. To generate synthetic data, the Common Crawl (German) dataset is exploited, which covers different domains. The synthetic data is effective for the low resource condition and is particularly helpful for our multilingual scenario where availability of summarizing data is still a challenging issue. The data are also useful in deep learning scenarios where the neural models require a large amount of training data for utilization of its capacity. The obtained summarization performance is measured in terms of ROUGE and BLEU score. We achieve an absolute improvement of +1.5 and +16.0 in ROUGE1 F1 (R1 F1) on the development and test sets, respectively, compared to the system which does not rely on data augmentation.

show abstract

CLTS: A New Chinese Long Text Summarization Dataset

Liu

Zhang

Chen

et al. 2020

Lecture Notes in Computer Science

View full text Add to dashboard Cite

The abstractive methods lack of creative ability is particularly a problem in automatic text summarization. The summaries generated by models are mostly extracted from the source articles. One of the main causes for this problem is the lack of dataset with abstractiveness, especially for Chinese. In order to solve this problem, we paraphrase the reference summaries in CLTS, the Chinese Long Text Summarization dataset, correct errors of factual inconsistencies, and propose the first Chinese Long Text Summarization dataset with a high level of abstractiveness, CLTS+, which contains more than 180K article-summary pairs and is available online 1 . Additionally, we introduce an intrinsic metric based on co-occurrence words to evaluate the dataset we constructed. We analyze the extraction strategies used in CLTS+ summaries against other datasets to quantify the abstractiveness and difficulty of our new data and train several baselines on CLTS+ to verify the utility of it for improving the creative ability of models.

show abstract

Indosum: A New Benchmark Dataset for Indonesian Text Summarization

Cited by 40 publications

References 21 publications

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

IndoLEM and IndoBERT: A Benchmark Dataset and Pre-trained Language Model for Indonesian NLP

Abstract Text Summarization: A Low Resource Challenge

CLTS: A New Chinese Long Text Summarization Dataset

Contact Info

Product

Resources

About