2018
DOI: 10.1002/asi.24074
|View full text |Cite
|
Sign up to set email alerts
|

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Abstract: Text reuse is becoming a serious issue in many fields and research shows that it is much harder to detect when it occurs across languages. The recent rise in multi-lingual content on the Web has increased cross-language text reuse to an unprecedented scale. Although researchers have proposed methods to detect it, one major drawback is the unavailability of large-scale gold standard evaluation resources built on real cases. To overcome this problem, we propose a cross-language sentence/passage level text reuse … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

1
10
0

Year Published

2019
2019
2024
2024

Publication Types

Select...
8
1

Relationship

1
8

Authors

Journals

citations
Cited by 13 publications
(11 citation statements)
references
References 19 publications
1
10
0
Order By: Relevance
“…According to Asghari et al [51], the main use of the corpus is due to the wide variety of languages it covers and which prevents the many common case errors that translate across different languages using Google Translate. The argument that corpus is the best CLPD evaluation method also found support from some studies [55], [58], [73], [77]. Artifact validation may vary, but they are all in the natural language context, and it is a topic that is still growing.…”
Section: B Discussionmentioning
confidence: 84%
See 1 more Smart Citation
“…According to Asghari et al [51], the main use of the corpus is due to the wide variety of languages it covers and which prevents the many common case errors that translate across different languages using Google Translate. The argument that corpus is the best CLPD evaluation method also found support from some studies [55], [58], [73], [77]. Artifact validation may vary, but they are all in the natural language context, and it is a topic that is still growing.…”
Section: B Discussionmentioning
confidence: 84%
“…As a final note, the researchers found that CL-ESA was not very effective; it is the most timeconsuming technique, and it is highly reliant upon the corpus employed. Other authors that supported the approach also included [6], [54], [55], [73], among others.…”
Section: B Discussionmentioning
confidence: 99%
“…According to surveys conducted by many researchers in China, the amount of raw materials used by students majoring in English is proportional to their language level. However, the results of research on the relationship between language usage and language proficiency of English learners of two languages in China and other countries are different [13,14]. Among researchers of other countries, low-level English learners are more likely to rely on language to express English than high-level English learners.…”
Section: Analysis Based On Frequencymentioning
confidence: 99%
“…The aim of the second phase was to extend the base process models collection by manually inducing the vocabulary mismatch problem at different levels of difficulty. This was materialized by borrowing the theoretical bases from the text reuse and plagiarism detection literature [25]- [27]. The literature suggests that an original text can be regenerated in three ways: Near Copy, Light Revision, and Heavy Revision.…”
Section: B: Generate Variants Of the Seed Process Modelsmentioning
confidence: 99%