In this paper we would like to discuss about stemming effect by using Nazief and Adriani algorithm against similarity detection result of Indonesian written abstract.
Abstract. The accesible loose information through the Internet leads to plagiarism activities use the copy-paste-modify practice is growing rapidly. There have been so many methods, algorithm, and even softwares that developed till this day to avoid and detect the plagiarism which can be used broadly unlimited on a certain subject. Research about detection of plagiarism in Indonesian Language develop day by day, although not significant as English Language. This paper proposes several models of distance-based similarity measure which could be used to assess the similarity in Indonesian text, such as Dice's similarity coefficient, Cosine similarity, and Jaccard coefficient. It implemented together with Rabin-Karp algorithm that common used to detect plagiarism in Indonesian Language. The analysis technique of plagiarism is fingerprint analysis to create fingerprint document according to n-gram value that has been determined, then the similarity value will be counted according to the same number of fingerprint between texts. Small data text about Information System tested in this case and it divided into four kinds of text document with some modified. First document is original text, second is 50% of original text adding with 50% of another text, third 50% original text modified using sinonym and paraphase, fourth some position of text in original text changed. From the experimental result, cosine similarity show better performance in generating value accuracy compared to the dice coefficient and Jaccard coefficient. This model is expected to be used as an alternative type of statistical algorithms that implement the n-grams in the process especially to detect plagiarism in Indonesian text.
Abstrak-Dalam bidang pengolahan bahasa alami dan sistem temu balik informasi, representasi sebuah data teks sangat penting untuk mendukung proses analisis data statistik di dalamnya. Data teks dengan bentuk tidak terstruktur dapat direpresentasikan secara sederhana menggunakan sekumpulan set kata yang disebut bag-of-words dan belum memiliki label atau kelas tertentu. Data unsupervised atau objek-objek yang belum memiliki label dapat dikelompokan menggunakan klustering berdasarkan kemiripan satu objek dengan objek lain. Artikel ini membahas perbandingan hasil pengelompokan unsupervised data menggunakan algoritma kluster yang tersedia pada tools Weka, yaitu SimpleKMeans, X-Means, dan Farthest First. SimpleKMeans dan XMeans digunakan untuk mengolah dataset dan mengelompokan berdasarkan jumlah kluster tetap yang digunakan, sedangkan Farthest First akan meletakan semua pusat kluster pada titik terjauh dari pusat kluster yang sudah ada untuk mengelompokan data. Dataset berasal dari UCI machine learning dengan menggunakan 3 koleksi data, yaitu Enron Email, NIPS Proceedings, dan Daily Kos Blog entries. Performa dataset diuji dengan berbagai masukan parameter yang berbeda meliputi jumlah kluster hingga evaluasi sum squared error (SSE), serta iterasi selama proses pengolahan data. Hasil penelitian diharapkan dapat dijadikan acuan untuk menentukan algoritma dan parameter yang sesuai untuk melakukan pengelompokan data yang tidak memiliki label.Kata Kunci-bag of words, kluster, label, teks, unsupervised
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.