2023
DOI: 10.22541/au.167528155.52982644/v1
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Are Neural Language Models Good Plagiarists? A Benchmark for Neural Paraphrase Detection

Abstract: Neural language models such as BERT allow for human-like text paraphrasing. This ability threatens academic integrity, as it aggravates identifying machine-obfuscated plagiarism. We make two contributions to foster the research on detecting these novel machine-paraphrases. First, we provide the first large-scale dataset of documents paraphrased using the Transformer-based models BERT, RoBERTa, and Longformer. The dataset includes paragraphs from scientific papers on arXiv, theses, and Wikipedia articles and th… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
5

Citation Types

0
7
0

Year Published

2023
2023
2023
2023

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(7 citation statements)
references
References 43 publications
0
7
0
Order By: Relevance
“…Most research on paraphrase identification quantifies to which degree the meaning of two sentences is identical. Approaches for this task employ lexical, syntactic, and semantic analysis (e.g., word embedding) as well as machine learning and deep learning techniques [12,50].…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…Most research on paraphrase identification quantifies to which degree the meaning of two sentences is identical. Approaches for this task employ lexical, syntactic, and semantic analysis (e.g., word embedding) as well as machine learning and deep learning techniques [12,50].…”
Section: Related Workmentioning
confidence: 99%
“…Wahle et al [50] is the only work, to date, that applies neural language models to generate machine paraphrased text. They use BERT and other popular neural language models to paraphrase an extensive collection of original content.…”
Section: Related Workmentioning
confidence: 99%
See 1 more Smart Citation
“…The same way word2vec [18] inspired many models in NLP [4,26,25], the excellent performance of BERT [8], a Transformer-based model [28], caused its numerous adaption for language tasks [34,6,29,30]. Domain-specific models build on top of Transformers typically outperform their baselines for related tasks [12].…”
Section: Related Workmentioning
confidence: 99%
“…Characteristics Stylometrics used in Author Attribution can be separated into different levels analysis (lexical, syntactic, semantic, structural and application specific). These characteristics can be: the number of words, sentences; the length of words, sentences; the frequency of tool words, form words, n-grams of words, n-grams of characters; [2,3] frequency of parts of speech; collocations (the frequency of bigrams parts of speech); the number of hapax legomena (words appearing only once); etc Jan et-al presented a benchmark neural paraphrase detection [4], in this research they show the paraphrased text maintains the semantics of the original source. In [5] the author contributed in machine paraphrased plagiarism.…”
Section: Related Workmentioning
confidence: 99%