The Turku Paraphrase Corpus is a dataset of over 100,000 Finnish paraphrase pairs. During the corpus creation, we strived to gather challenging paraphrase pairs, more suitable to test the capabilities of natural language understanding models. The paraphrases are both selected and classified manually, so as to minimise lexical overlap, and provide examples that are structurally and lexically different to the maximum extent. An important distinguishing feature of the corpus is that most of the paraphrase pairs are extracted and distributed in their native document context, rather than in isolation. The primary application for the dataset is the development and evaluation of deep language models, and representation learning in general.
This document describes the annotation guidelines used to construct the Turku Paraphrase Corpus. These guidelines were developed together with the corpus annotation, revising and extending the guidelines regularly during the annotation work. Our paraphrase annotation scheme uses the base scale 1-4, where labels 1 and 2 are used for negative candidates (not paraphrases), while labels 3 and 4 are paraphrases at least in the given context if not everywhere. In addition to base labeling, the scheme is enriched with additional subcategories (flags) for categorizing different types of paraphrases inside the two positive labels, making the annotation scheme suitable for more fine-grained paraphrase categorization. The annotation scheme is used to annotate over 100,000 Finnish paraphrase pairs. 'I'll be there in half an hour' Saavun 30 minuutin kuluessa 'I will arrive in 30 minutes' →4 Voitinko minä?'Did I win?'Olenko voittaja? 'Am I the winner?' →4 Tykkään kasvojesi painaumista.'I like the dents on your face.'Pidän noista kuopista kasvoissasi 'I enjoy those impressions on your face.' →4• Merely syntactic difference is not accounted, e.g. lack of discourse connective.Mutta kolme kertaa? Silloin karma yrittää kertoa jotakin.'But three times? Then karma is trying to tell you something.' Kolme kertaa tarkoittaa, että karma yrittää kertoa jotain. 'Three times means that karma is trying to tell you something.' →4• Absurd interpretations of an utterance are not considered. Also idioms and phrasal expressions can have a different literal meaning, but if the literal meaning is absurd it is not taken into account. Also clear metaphors and simile are label 4 when the literary meaning is absurd. Usually the word "kuin" indicates a simile.Tämä aika vuodesta on pahasta hiuksilleni.'This time of year is bad for my hair.'Hiukseni eivät pidä tästä vuodenajasta. 'My hair does not like this time of year.' →4 voin heittää sinut kotiin 'I can give you a ride home' pääset minun kyydissäni kotiin 'you can catch a lift home with me' →4 Työskentelee kuin ahkera mehiläinen 'He/she is working busy as a bee'Ahertaa kuin muurahainen 'Toiling away like an ant' →4• The difference in number, tense or similar is not taken into consideration in metaphors or simile since it is part of the expression and does not affect meaning, same for plurale tantum expressions. Se matkaa tähtien välillä kuin aurinkotuulet'It is traveling between the stars like a solar wind' Se leviää tähtien välissä kuin kulovalkea 'It is catching on among the stars like wildfire' →4 Häät 'Wedding' Hääjuhla 'Wedding party' →4 Olen vain idiootti, joka höpöttää puhelinvastaajaan 'I'm just a fool who is blabbing on an answering machine' Olen niitä idiootteja jotka höpöttävät vastaajaan 'I'm one of those idiots who blab on answering machines' →4 'I have got an idea' Minulla on idea. 'I have a thought.' →4 Compare to Parkkeerasimme tähän 'We parked here'Parkkeeraamme tähän 'We will be parking here' →4i• Proper nouns can be considered the same as their common noun descriptions when there is...
Earlier studies have shown that children are efficient second language learners. Research has also shown that musical background might affect second language learning. A two-day auditory training paradigm was used to investigate whether studying in a music-oriented education program affects children’s sensitivity to acquire a non-native vowel contrast. Training effects were measured with listen-and-repeat production tests. Two groups of monolingual Finnish children (9–11 years, N=23) attending music-oriented and regular fourth grades were tested. The stimuli were two semisynthetic pseudo words /ty:ti/ and /tʉ:ti/ with the native vowel /y/ and the non-native vowel /ʉ/ embedded. Both groups changed their pronunciation after the first training. The change was reflected in the second formant values of /ʉ/, which lowered significantly after three trainings. The results show that 9–11-year-old children benefit from passive auditory training in second language production learning regardless of whether or not they attend a music-oriented education program.
In this paper, we study natural language paraphrasing from both corpus creation and modeling points of view. We focus in particular on the methodology that allows the extraction of challenging examples of paraphrase pairs in their natural textual context, leading to a dataset potentially more suitable for evaluating the models’ ability to represent meaning, especially in document context, when compared with those gathered using various sentence-level heuristics. To this end, we introduce the Turku Paraphrase Corpus, the first large-scale, fully manually annotated corpus of paraphrases in Finnish. The corpus contains 104,645 manually labeled paraphrase pairs, of which 98% are verified to be true paraphrases, either universally or within their present context. In order to control the diversity of the paraphrase pairs and avoid certain biases easily introduced in automatic candidate extraction, the paraphrases are manually collected from different paraphrase-rich text sources. This allows us to create a challenging dataset including longer and more lexically diverse paraphrases than can be expected from those collected through heuristics. In addition to quality, this also allows us to keep the original document context for each pair, making it possible to study paraphrasing in context. To our knowledge, this is the first paraphrase corpus which provides the original document context for the annotated pairs. We also study several paraphrase models trained and evaluated on the new data. Our initial paraphrase classification experiments indicate a challenging nature of the dataset when classifying using the detailed labeling scheme used in the corpus annotation, the accuracy substantially lacking behind human performance. However, when evaluating the models on a large scale paraphrase retrieval task on almost 400M candidate sentences, the results are highly encouraging, 29–53% of the pairs being ranked in the top 10 depending on the paraphrase type. The Turku Paraphrase Corpus is available at github.com/TurkuNLP/Turku-paraphrase-corpus as well as through the popular HuggingFace datasets under the CC-BY-SA license.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.