Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) 2019
DOI: 10.18653/v1/k19-1005
|View full text |Cite
|
Sign up to set email alerts
|

Large-Scale, Diverse, Paraphrastic Bitexts via Sampling and Clustering

Abstract: Producing diverse paraphrases of a sentence is a challenging task. Natural paraphrase corpora are scarce and limited, while existing large-scale resources are automatically generated via back-translation and rely on beam search, which tends to lack diversity. We describe PARABANK 2, a new resource that contains multiple diverse sentential paraphrases, produced from a bilingual corpus using negative constraints, inference sampling, and clustering. We show that PARABANK 2 significantly surpasses prior work in bo… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
48
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
4
3
2

Relationship

3
6

Authors

Journals

citations
Cited by 38 publications
(48 citation statements)
references
References 41 publications
0
48
0
Order By: Relevance
“…BART is a Transformer (Vaswani et al, 2017) neural network trained on a large unlabeled corpus with a sentence reconstruction loss. We fine-tune it for 4 epochs on sentence pairs from PARABANK 2 (Hu et al, 2019a), which is a paraphrase dataset constructed by back-translating the Czech portion of an English-Czech parallel corpus. We use a subset of 5 million sentence pairs with the highest dual conditional cross-entropy score (Junczys-Dowmunt, 2018), and use only one of the five paraphrases provided for each sentence.…”
Section: Autoqa Implementationmentioning
confidence: 99%
See 1 more Smart Citation
“…BART is a Transformer (Vaswani et al, 2017) neural network trained on a large unlabeled corpus with a sentence reconstruction loss. We fine-tune it for 4 epochs on sentence pairs from PARABANK 2 (Hu et al, 2019a), which is a paraphrase dataset constructed by back-translating the Czech portion of an English-Czech parallel corpus. We use a subset of 5 million sentence pairs with the highest dual conditional cross-entropy score (Junczys-Dowmunt, 2018), and use only one of the five paraphrases provided for each sentence.…”
Section: Autoqa Implementationmentioning
confidence: 99%
“…Most large-scale paraphrasing datasets are built using bilingual text(Ganitkevitch et al, 2013) and machine translation(Mallinson et al, 2017) or obtained with noisy heuristics(Prakash et al, 2016). Based on human judgement, even some of the better paraphrasing datasets score only 68%-84% on semantic similarity(Hu et al, 2019a, Yang et al, 2019.…”
mentioning
confidence: 99%
“…We optimize using Adam (Kingma and Ba, 2015). We train on PARABANK2 (Hu et al, 2019c), an English paraphrase dataset. 2 PARABANK2 was generated by training an MT system on CzEng 1.7 (a Czech−English bitext with over 50 million lines (Bojar et al, 2016)), re-translating the Czech training sentences, and pairing the English output with the original English translation.…”
Section: Paraphrasermentioning
confidence: 99%
“…Table 8 shows the Spearman ρ coefficient with STSbenchmark judgments for cosine and approximate LSH Hamming distances of embeddings for BERT, SBERT (and larger variant SRoBERTa), and pBERT (Hu et al, 2019b), a BERT model fine-tuned to predict paraphrastic similarity, albiet not via angular similarity of embeddings. Table 9 provides details regarding the distributions of sentences into LSH bins of differing levels of granularity using SRoBERTa-L embeddings.…”
Section: E Cosine/lsh Hamming Correlations With Sts and Bin Statisticsmentioning
confidence: 99%