Proceedings of the 2018 EMNLP Workshop W-Nut: The 4th Workshop on Noisy User-Generated Text 2018
DOI: 10.18653/v1/w18-6109
|View full text |Cite
|
Sign up to set email alerts
|

Paraphrase Detection on Noisy Subtitles in Six Languages

Abstract: We perform automatic paraphrase detection on subtitle data from the Opusparcus corpus comprising six European languages: German, English, Finnish, French, Russian, and Swedish. We train two types of supervised sentence embedding models: a word-averaging (WA) model and a gated recurrent averaging network (GRAN) model. We find out that GRAN outperforms WA and is more robust to noisy training data. Better results are obtained with more and noisier data than less and cleaner data. Additionally, we experiment on ot… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
11
0

Year Published

2020
2020
2022
2022

Publication Types

Select...
3
2
1

Relationship

1
5

Authors

Journals

citations
Cited by 7 publications
(11 citation statements)
references
References 12 publications
0
11
0
Order By: Relevance
“…We sample our training examples similarly to Sjöblom et al (2018), who picked a certain number of assumed positive examples of paraphrases from the beginning of the training sets, and sampled an equal number of assumed negative examples by randomly pairing sentences. Sjöblom et al (2018) experimented with data sets containing between one and thirty million positive examples. The level of correct labels in these subsets of data are shown in Table 1, which has been compiled from values reported by Creutz (2018) and Sjöblom et al (2018).…”
Section: Data Selection For Bert Fine-tuningmentioning
confidence: 99%
See 3 more Smart Citations
“…We sample our training examples similarly to Sjöblom et al (2018), who picked a certain number of assumed positive examples of paraphrases from the beginning of the training sets, and sampled an equal number of assumed negative examples by randomly pairing sentences. Sjöblom et al (2018) experimented with data sets containing between one and thirty million positive examples. The level of correct labels in these subsets of data are shown in Table 1, which has been compiled from values reported by Creutz (2018) and Sjöblom et al (2018).…”
Section: Data Selection For Bert Fine-tuningmentioning
confidence: 99%
“…Sjöblom et al (2018) experimented with data sets containing between one and thirty million positive examples. The level of correct labels in these subsets of data are shown in Table 1, which has been compiled from values reported by Creutz (2018) and Sjöblom et al (2018). The reported best models were trained on data sets containing between 60 % and 80 % of presumably correctly labeled paraphrases.…”
Section: Data Selection For Bert Fine-tuningmentioning
confidence: 99%
See 2 more Smart Citations
“…Michel and Neubig (2018) provided types of noises observed in social media texts and focused on spelling and grammatical errors, emojis, and profanities to perform machine translation. Sjöblom et al (2018) defined noises for subtitles including misspellings, misalignments, and sentence segmentation errors. Agarwal et al (2020) summarized noise types in free-text answers written by Hindi children, such as punctuations, emojis, translated/transliterated text, missing space between words, and so on.…”
Section: Studies On Noisy Textsmentioning
confidence: 99%