Fast and reliable plagiarism detection system

Mozgovoy, Maxim; Karakovskiy, Sergey; Klyuev, Vitaly

doi:10.1109/fie.2007.4417860

Cited by 16 publications

(14 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…developed with the aim to improve the speed of similarity detection by using an indexed data structure to store files [9,36]. The tokenenized versions of source code files are compared using an algorithm similar to the RKR-GST algorithm.…”

Section: Fpds (Fast Plagiarism Detection System) Is a Source Code Simmentioning

confidence: 99%

A Source Code Similarity System for Plagiarism Detection

Duric

Gašević

2012

The Computer Journal

View full text Add to dashboard Cite

Source code plagiarism is an easy to do task, but very difficult to detect without proper tool support. Various source code similarity detection systems have been developed to help detect source code plagiarism. Those systems need to recognize a number of lexical and structural source code modifications. For example, by some structural modifications (e.g., modification of control structures, modification of data structures or structural redesign of source code) the source code can be changed in such a way that it almost looks genuine. Most of the existing source code similarity detection systems can be confused when these structural modifications have been applied to the original source code. In order to be considered effective, a source code similarity detection system must address these issues. In order to address them, we designed and developed the source code similarity system for plagiarism detection. To demonstrate that the proposed system has the desired effectiveness, we performed a wellknown conformism test. The proposed system showed promising results as compared to the JPlag system in detecting source code similarity when various lexical or structural modifications are applied to plagiarized code. As a confirmation of these results, an independent samples t-test revealed that there was a statistically significant difference between average values of F-measures for the test sets that we used and for the experiments that we have done in the practically usable range of cutoff threshold values of 35%-70%.

show abstract

Section: Fpds (Fast Plagiarism Detection System) Is a Source Code Simmentioning

confidence: 99%

A Source Code Similarity System for Plagiarism Detection

Duric

Gašević

2012

The Computer Journal

View full text Add to dashboard Cite

show abstract

“…This toolbox may be integrated into open source VLE Moodle. Other scholars deal with plagiarism in the area of using visualization method to find plagiarism in automated student assessments (Graven and MacKinnon 2008), and improving plagiarism detecting systems for the fastest and the most reliable (Mozgovoy et al 2007).…”

Section: Related Workmentioning

confidence: 99%

“…Statistical analysis and data mining of outliers is also a possible source to find out useful information for course instructors, for example, recognition of unconcerned students (Mozgovoy et al 2007), etc.…”

Section: Conclusion and Further Investigationmentioning

confidence: 99%

Analysis of Students’ Study Activities in Virtual Learning Environments Using Data Mining Methods / Studentų, Besimokančių Virtualaus Mokymo Aplinkoje, Veiklos Analizė Taikant Duomenų Gavybos Metodus

Preidys¹,

Sakalauskas²

2010

Technological and Economic Development of Economy

View full text Add to dashboard Cite

Abstract. This article deals with application of data mining methods' to analysis of learners' behaviour using the distance learning platform BlackBoard Vista (BlackBoard 2008). Before planning a distance learning course, instructors have to pay attention to the fact that there exist different study methods: some students start reading learning materials from the very beginning to the end, some students look at unclear topics only, some start with the discussions, etc. Therefore after analyzing the learning factors and identifying learner's style, it is possible to prepare individualized learning materials and to choose a proper way of course presentation. Such a way of study organization would improve the quality of studies and make it possible to reach better results. The research was performed by observing the behaviour and results achieved by 528 students in 15 distance learning courses and, using the clustering method, 3 learner's styles using virtual learning environments (VLE) have been identified and work methods proposed for students with regard to those learners' styles. Besides, the research aims to find out the factors that influence final evaluations of students' .

show abstract

“…(I) Euclidean distances on the tf-idf weights like in the previous data set, however, tf and idf now refer to the occurrence of each token instead of term, (II) the Cosine distance on the token frequencies, (III) the normalized compression distance (NCD) on the token streams, (IV) Greedy String Tiling (GST) which is the inherent similarity measure that Plaggie uses to compare the given sources [29,30]; since GST yields a matrix S of pairwise similarities s(x i , x j ) ∈ S, where values are in (0, 1) and self-similarities equal 1, we converted S into a dissimilarity matrix by taking D := √ 1 − S, as proposed in [23]. Fig.…”

Section: Java Programsmentioning

confidence: 99%

“…We used the open source plagiarism detection software Plaggie [29] to extract a tokenized representation (a token stream) from each given Java source code. Based on the token streams, we consider four different dissimilarity measures:…”

Section: Java Programsmentioning

confidence: 99%

How to Quantitatively Compare Data Dissimilarities for Unsupervised Machine Learning?

Mokbel

Groß

Lux

et al. 2012

Artificial Neural Networks in Pattern Recognition

View full text Add to dashboard Cite

For complex data sets, the pairwise similarity or dissimilarity of data often serves as the interface of the application scenario to the machine learning tool. Hence, the final result of training is severely influenced by the choice of the dissimilarity measure. While dissimilarity measures for supervised settings can eventually be compared by the classification error, the situation is less clear in unsupervised domains where a clear objective is lacking. The question occurs, how to compare dissimilarity measures and their influence on the final result in such cases. In this contribution, we propose to use a recent quantitative measure introduced in the context of unsupervised dimensionality reduction, to compare whether and on which scale dissimilarities coincide for an unsupervised learning task. Essentially, the measure evaluates in how far neighborhood relations are preserved if evaluated based on rankings, this way achieving a robustness of the measure against scaling of data. Apart from a global comparison, local versions allow to highlight regions of the data where two dissimilarity measures induce the same results.

show abstract

Fast and reliable plagiarism detection system

Cited by 16 publications

References 10 publications

A Source Code Similarity System for Plagiarism Detection

A Source Code Similarity System for Plagiarism Detection

Analysis of Students’ Study Activities in Virtual Learning Environments Using Data Mining Methods / Studentų, Besimokančių Virtualaus Mokymo Aplinkoje, Veiklos Analizė Taikant Duomenų Gavybos Metodus

How to Quantitatively Compare Data Dissimilarities for Unsupervised Machine Learning?

Contact Info

Product

Resources

About