PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

Doostmohammadi, Ehsan; Bokaei, Mohammad Hadi; Sameti, Hossein

doi:10.1109/istel.2018.8661095

Cited by 4 publications

(3 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…PerKey (Doostmohammadi et al, 2018) is a key phrase extraction dataset for the Persian language crawled from six Persian news agencies. There are 553k articles available in this dataset.…”

Section: Downstream Datasetsmentioning

confidence: 99%

ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization

Salemi¹,

Kebriaei²,

Minaei³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

ive text summarization is one of the areas influenced by the emergence of pre-trained language models. Current pre-training works in abstractive summarization give more points to the summaries with more words in common with the main text and pay less attention to the semantic similarity between generated sentences and the original document. We propose ARMAN, a Transformer-based encoderdecoder model pre-trained with three novel objectives to address this issue. In ARMAN, salient sentences from a document are selected according to a modified semantic score to be masked and form a pseudo summary. To summarize more accurately and similar to human writing patterns, we applied modified sentence reordering. We evaluated our proposed models on six downstream Persian summarization tasks. Experimental results show that our proposed model achieves state-of-the-art performance on all six summarization tasks measured by ROUGE and BERTScore. Our models also outperform prior works in textual entailment, question paraphrasing, and multiple choice question answering. Finally, we established a human evaluation and show that using the semantic score significantly improves summarization results.

show abstract

“…PerKey (Doostmohammadi et al, 2018) is a key phrase extraction dataset for the Persian language crawled from six Persian news agencies. There are 553k articles available in this dataset.…”

Section: Downstream Datasetsmentioning

confidence: 99%

ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization

Salemi¹,

Kebriaei²,

Minaei³

et al. 2021

Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

show abstract

“…We used KEA [36] as our supervised base-line method. For more information on the hyperparameters, settings and implementation of the base-line models see [29].…”

Section: Baseline Modelsmentioning

confidence: 99%

“…Here, we use a subset of the PerKey dataset introduced in [29] with at least 3 keyphrases for each news article. As concluded in PerKey paper, news articles with at least 3 keyphrases are more reliable in terms of recall.…”

Section: A Training and Testing Datasetsmentioning

confidence: 99%

Persian Keyphrase Generation Using Sequence-to-Sequence Models

Doostmohammadi

Bokaei

Sameti

2019

2019 27th Iranian Conference on Electrical Engineering (ICEE)

Self Cite

View full text Add to dashboard Cite

Keyphrases are a very short summary of an input text and provide the main subjects discussed in the text. Keyphrase extraction is a useful upstream task and can be used in various natural language processing problems, for example, text summarization and information retrieval, to name a few. However, not all the keyphrases are explicitly mentioned in the body of the text. In real-world examples there are always some topics that are discussed implicitly. Extracting such keyphrases requires a generative approach, which is adopted here. In this paper, we try to tackle the problem of keyphrase generation and extraction from news articles using deep sequence-to-sequence models. These models significantly outperform the conventional methods such as Topic Rank, KPMiner, and KEA in the task of keyphrase extraction 1 .

show abstract

PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

Doostmohammadi

Bokaei

Sameti

2018

2018 9th International Symposium on Telecommunications (IST)

View full text Add to dashboard Cite

Keyphrases provide an extremely dense summary of a text. Such information can be used in many Natural Language Processing tasks, such as information retrieval and text summarization. Since previous studies on Persian keyword or keyphrase extraction have not published their data, the field suffers from the lack of a human extracted keyphrase dataset. In this paper, we introduce PerKey 1 , a corpus of 553k news articles from six Persian news websites and agencies with relatively high quality author extracted keyphrases, which is then filtered and cleaned to achieve higher quality keyphrases. The resulted data was put into human assessment to ensure the quality of the keyphrases. We also measured the performance of different supervised and unsupervised techniques, e.g. TFIDF, MultipartiteRank, KEA, etc. on the dataset using precision, recall, and F 1-score.

show abstract

PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

Cited by 4 publications

References 19 publications

ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization

ARMAN: Pre-training with Semantically Selecting and Reordering of Sentences for Persian Abstractive Summarization

Persian Keyphrase Generation Using Sequence-to-Sequence Models

PerKey: A Persian News Corpus for Keyphrase Extraction and Generation

Contact Info

Product

Resources

About