CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Muneer, Iqra; Sharjeel, Muhammad; Iqbal, Muntaha; Nawab, Rao Muhammad Adeel; Rayson, Paul

doi:10.1002/asi.24074

Cited by 13 publications

(11 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…According to Asghari et al [51], the main use of the corpus is due to the wide variety of languages it covers and which prevents the many common case errors that translate across different languages using Google Translate. The argument that corpus is the best CLPD evaluation method also found support from some studies [55], [58], [73], [77]. Artifact validation may vary, but they are all in the natural language context, and it is a topic that is still growing.…”

Section: B Discussionmentioning

confidence: 84%

See 1 more Smart Citation

Cross-Language Plagiarism Detection: Methods, Tools, and Challenges: A Systematic Review

Botto-Tobar¹,

Serebrenik²,

Brand³

2022

International Journal on Advanced Science, Engineering and Information Technology

View full text Add to dashboard Cite

Plagiarism is one of the most serious academic offenses. However, people have adopted different approaches to avoid plagiarism, such as transcribing excerpts from one language. Thus, it is challenging to realize this plagiarism form unless someone fully understands another language. Researchers have developed approaches for detecting plagiarism in a variety of different languages. However, most methods created in the past have proved effective for detecting plagiarism in papers published in a single language, most notably English. Therefore, this paper aims to provide a systematic literature review of cross-language plagiarism detection methods (CLPD) in a natural language context. The approach used to perform this study consisted of an extensive search for relevant literature through an SLR and Snowballing. Therefore, we present an overview of (i) cross-language plagiarism detection techniques; (ii) the artifacts and the aspects that were considered in the evaluation phase; and (iii) the lack of guidelines and tools for its implementation. Its contribution lies in its ability to highlight emerging cross-language plagiarism detection techniques trends. Further, we identify any of these techniques in other domains, for instance, software engineering.

show abstract

Section: B Discussionmentioning

confidence: 84%

“…As a final note, the researchers found that CL-ESA was not very effective; it is the most timeconsuming technique, and it is highly reliant upon the corpus employed. Other authors that supported the approach also included [6], [54], [55], [73], among others.…”

Section: B Discussionmentioning

confidence: 99%

Cross-Language Plagiarism Detection: Methods, Tools, and Challenges: A Systematic Review

Botto-Tobar¹,

Serebrenik²,

Brand³

2022

International Journal on Advanced Science, Engineering and Information Technology

View full text Add to dashboard Cite

show abstract

“…According to surveys conducted by many researchers in China, the amount of raw materials used by students majoring in English is proportional to their language level. However, the results of research on the relationship between language usage and language proficiency of English learners of two languages in China and other countries are different [13,14]. Among researchers of other countries, low-level English learners are more likely to rely on language to express English than high-level English learners.…”

Section: Analysis Based On Frequencymentioning

confidence: 99%

Analysis of Matching of Corpus Input and English Proficiency Based on the Big Data Neural Network Model

2022

Advances in Multimedia

View full text Add to dashboard Cite

In the era of “Internet +” big data, the theory and technology of English corpus are becoming more and more mature. Corpus is an important method to reflect some language characteristics and clarify some language phenomena. In terms of cultural exchanges, Chinese students majoring in English have obvious cultural differences at home and abroad and lack the atmosphere and context for cultural exchanges. In addition, students have problems such as insufficient cultural communication skills. The big data neural network model is adopted in this paper to compare and analyze the intermediary sentences in the corpus to explore the development trend of English proficiency. Through the analysis of typical cases, it explores the weak links in the corpus teaching process and summarizes a method focusing on the combination of use of corpus and English teaching.

show abstract

“…The aim of the second phase was to extend the base process models collection by manually inducing the vocabulary mismatch problem at different levels of difficulty. This was materialized by borrowing the theoretical bases from the text reuse and plagiarism detection literature [25]- [27]. The literature suggests that an original text can be regenerated in three ways: Near Copy, Light Revision, and Heavy Revision.…”

Section: B: Generate Variants Of the Seed Process Modelsmentioning

confidence: 99%

A Process Model Collection and Gold Standard Correspondences for Process Model Matching

et al. 2019

Self Cite

View full text Add to dashboard Cite

Business process models are the conceptual models to depict the workflow of an organization. Process model matching (PMM) refers to the automatic identification of corresponding activities between a pair of process models that show similar or the same behavior. During the last few years, PMM has received much of the researchers' attention due to its wide range of applications, such as clone detection and harmonization of process models. Consequently, a plethora of PMM techniques has been developed. In order to evaluate the effectiveness of these techniques, experts have developed three benchmark datasets, formally called PMMC'15 datasets. Furthermore, the process models in the datasets have been converted into OAEI'17 ontologies. These resources are a valuable asset for the PMM community to evaluate process model matching techniques. However, these resources (PMMC'15 and OAEI'17) are limited to fewer models and a handful collection of corresponding activities among these models that may not be sufficient to rigorously evaluate the PMM techniques. To fill this gap, this paper provides a large, diverse, and a carefully handcrafted collection of process models, along with their benchmark correspondences. The process model collection and benchmark correspondences between these models are freely available for the community [1]. Our newly developed dataset, together with the existing resources, can be used for a thorough evaluation of PMM techniques, especially in the context of the vocabulary mismatch problem. At last, we have evaluated the characteristics of our dataset by a series of experiments while involving widely used similarity measures in PMM research. The results reveal that our dataset is larger, diverse, and challenging as compared to existing datasets in the PMM domain.

show abstract

CLEU ‐ A Cross‐language english‐urdu corpus and benchmark for text reuse experiments

Cited by 13 publications

References 19 publications

Cross-Language Plagiarism Detection: Methods, Tools, and Challenges: A Systematic Review

Cross-Language Plagiarism Detection: Methods, Tools, and Challenges: A Systematic Review

Analysis of Matching of Corpus Input and English Proficiency Based on the Big Data Neural Network Model

A Process Model Collection and Gold Standard Correspondences for Process Model Matching

Contact Info

Product

Resources

About