Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) 2018
DOI: 10.18653/v1/p18-1042
|View full text |Cite
|
Sign up to set email alerts
|

ParaNMT-50M: Pushing the Limits of Paraphrastic Sentence Embeddings with Millions of Machine Translations

Abstract: We describe PARANMT-50M, a dataset of more than 50 million English-English sentential paraphrase pairs. We generated the pairs automatically by using neural machine translation to translate the non-English side of a large parallel corpus, following . Our hope is that PARANMT-50M can be a valuable resource for paraphrase generation and can provide a rich source of semantic knowledge to improve downstream natural language understanding tasks. To show its utility, we use PARANMT-50M to train paraphrastic sentence… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
217
0
4

Year Published

2019
2019
2024
2024

Publication Types

Select...
5
4
1

Relationship

1
9

Authors

Journals

citations
Cited by 266 publications
(221 citation statements)
references
References 48 publications
0
217
0
4
Order By: Relevance
“…This motivates the case for exploiting incidental annotation (Roth, 2017) and automating some aspects of dataset creation. The current trend of using machine translation systems to produce augmented datasets for machine translation itself (Sennrich et al, 2016) and for monolingual tasks like classification (Yu et al, 2018) and paraphrasing (Wieting and Gimpel, 2018) is a good example of this.…”
Section: Datamentioning
confidence: 99%
“…This motivates the case for exploiting incidental annotation (Roth, 2017) and automating some aspects of dataset creation. The current trend of using machine translation systems to produce augmented datasets for machine translation itself (Sennrich et al, 2016) and for monolingual tasks like classification (Yu et al, 2018) and paraphrasing (Wieting and Gimpel, 2018) is a good example of this.…”
Section: Datamentioning
confidence: 99%
“…As our bitext, we use the Czeng1.6 English-Czech parallel corpus (Bojar et al, 2016). We compare it to training on ParaNMT (Wieting and Gimpel, 2018), a corpus of 50 million paraphrases obtained from automatically translating the Czech side of Czeng1.6 into English. We sample 1 million examples from ParaNMT and Czeng1.6 and evaluate on all 25 datasets from the STS tasks from 2012-2016.…”
Section: Back-translated Text Vs Parallel Textmentioning
confidence: 99%
“…The model computes the average of vectors of all words and n-grams in a sentence. Several studies have shown successful results produced by Sent2vec [32,33]. The model learns a context embedding v w and target embedding u w for each word w in the vocabulary, with h number of embedding dimensions.…”
Section: Sent2vecmentioning
confidence: 99%