2022
DOI: 10.1007/978-3-031-17258-8_27
|View full text |Cite
|
Sign up to set email alerts
|

Textual Paraphrase Dataset for Deep Language Modelling

Abstract: The Turku Paraphrase Corpus is a dataset of over 100,000 Finnish paraphrase pairs. During the corpus creation, we strived to gather challenging paraphrase pairs, more suitable to test the capabilities of natural language understanding models. The paraphrases are both selected and classified manually, so as to minimise lexical overlap, and provide examples that are structurally and lexically different to the maximum extent. An important distinguishing feature of the corpus is that most of the paraphrase pairs a… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
4
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
2
1
1

Relationship

1
3

Authors

Journals

citations
Cited by 4 publications
(5 citation statements)
references
References 2 publications
0
4
0
Order By: Relevance
“…The Turku Paraphrase Corpus introduced in this paper, the first large-scale, manually annotated paraphrase corpus for Finnish, includes 91,604 manually extracted and labeled paraphrases with an additional 13,041 human-made rephrasing of statements. While the first incomplete version of the corpus was released in Kanerva et al (2021b), the current work extends the contributions into multiple directions: (1) the corpus size is doubled from the first release, (2) the text sources used to gather the paraphrases are extended from alternative subtitles and news headings to include also news articles, university student essays, translation exercises made by university students, as well as messages from an online discussion forum, (3) each manually extracted paraphrase is distributed together with the original document context to allow studies on paraphrasing in context, (4) in addition to manually extracted and labeled paraphrases, an automatically extracted subset of the corpus that contains related nonparaphrase segments is provided to support paraphrase classification.…”
Section: Resources For Finnishmentioning
confidence: 99%
See 2 more Smart Citations
“…The Turku Paraphrase Corpus introduced in this paper, the first large-scale, manually annotated paraphrase corpus for Finnish, includes 91,604 manually extracted and labeled paraphrases with an additional 13,041 human-made rephrasing of statements. While the first incomplete version of the corpus was released in Kanerva et al (2021b), the current work extends the contributions into multiple directions: (1) the corpus size is doubled from the first release, (2) the text sources used to gather the paraphrases are extended from alternative subtitles and news headings to include also news articles, university student essays, translation exercises made by university students, as well as messages from an online discussion forum, (3) each manually extracted paraphrase is distributed together with the original document context to allow studies on paraphrasing in context, (4) in addition to manually extracted and labeled paraphrases, an automatically extracted subset of the corpus that contains related nonparaphrase segments is provided to support paraphrase classification.…”
Section: Resources For Finnishmentioning
confidence: 99%
“…Our paraphrase classification model is a pairwise classifier based on the BERT encoder, following our initial work reported in Kanerva et al (2021b)…”
Section: Paraphrase Classifiermentioning
confidence: 99%
See 1 more Smart Citation
“…For the generative task, we use a paraphrase dataset from (Kanerva et al, 2021). We use 21k for training, 2.6k for validation, and 2.6k for testing.…”
Section: Downstream Tasksmentioning
confidence: 99%
“…Finnish and Swedish The Finnish Paraphrase Corpus (Kanerva et al, 2021), a dataset of manually selected subtitle lines and news headlines, annotated on a four-class ordinal scale. A small test set in Swedish is likewise available.…”
Section: Datasetsmentioning
confidence: 99%