2018 International Conference on Asian Language Processing (IALP) 2018
DOI: 10.1109/ialp.2018.8629109
|View full text |Cite
|
Sign up to set email alerts
|

Indosum: A New Benchmark Dataset for Indonesian Text Summarization

Abstract: Automatic text summarization is generally considered as a challenging task in the NLP community. One of the challenges is the publicly available and large dataset that is relatively rare and difficult to construct. The problem is even worse for low-resource languages such as Indonesian.In this paper, we present INDOSUM, a new benchmark dataset for Indonesian text summarization. The dataset consists of news articles and manually constructed summaries. Notably, the dataset is almost 200x larger than the previous… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
27
1
10

Year Published

2019
2019
2024
2024

Publication Types

Select...
3
2
2

Relationship

0
7

Authors

Journals

citations
Cited by 40 publications
(38 citation statements)
references
References 21 publications
0
27
1
10
Order By: Relevance
“…Because the sentiment analysis data is low-resource and imbalanced, we use stratified 5-fold cross-validation, and evaluate based on F1 score. For summarization, on the other hand, we use the canonical splits provided by Kurniawan and Louvan (2018), and evaluate the resulting summary with ROUGE (F1) (Lin, 2004) in the form of three different metrics: R1, R2, and RL.…”
Section: Evaluation Methodologymentioning
confidence: 99%
See 3 more Smart Citations
“…Because the sentiment analysis data is low-resource and imbalanced, we use stratified 5-fold cross-validation, and evaluate based on F1 score. For summarization, on the other hand, we use the canonical splits provided by Kurniawan and Louvan (2018), and evaluate the resulting summary with ROUGE (F1) (Lin, 2004) in the form of three different metrics: R1, R2, and RL.…”
Section: Evaluation Methodologymentioning
confidence: 99%
“…Aristoteles et al (2012) deployed a genetic algorithm over a 200-document summarization dataset, and Gunawan et al (2017) performed unsupervised summarization over 3,075 news articles. As an attempt to create a standardized corpus, Koto (2016) released a 300-document chat summarization dataset, and Kurniawan and Louvan (2018) released the IndoSum 19K document-summary dataset. At the time we carried out this work, 20 IndoSum was the largest Indonesian summarization corpus in the news domain, manually constructed from CNN Indonesia 21 and Kumparan 22 documents.…”
Section: Semantic Tasksmentioning
confidence: 99%
See 2 more Smart Citations
“…Further improvements are introduced to the baseline model by using the pointer generator network and coverage mechanism using reinforcement learning based training procedure (See et al, 2017;Paulus et al, 2017). There is an inherent limitation to natural language processing tasks such as text summarization for resource-poor and morphological complex languages owing to a shortage of quality linguistic data available (Kurniawan and Louvan, 2018). The use of synthetic data along with the real data is one of the popular approaches followed in machine translation domain for the low resource conditions to improve the translation quality (Bojar and Tamchyna, 2011;Hoang et al, 2018;Chinea-Rıos et al, 2017).…”
Section: Introductionmentioning
confidence: 99%