The Web has enormous, various and knowledgeable data for data mining research. Clustering web usage data is useful to discover interesting patterns pertaining to user traversals, behaviour and their usage characteristics. Moreover, users accesses web pages in an order in which they are interested and hence incorporating sequence nature of their usage is crucial for clustering web transactions. In this paper we present OPTICS ("Ordering Points To Identify the Clustering Structure") algorithm to find density based clusters on a web usage data on MSNBC.COM website which is a free news data website with so different categories of news).The clusters are generated by OPTICS algorithm . The average of inter cluster and intra cluster are Calculated. the results are compared with different similarity measures like Euclidean , Jaccard, projected Euclidean, cosine and fuzzy similarity Finally showed behavior of clusters that made by OPTICS algorithm on a sequential data in a web usage domain. we performed a variety of experiments in the context of density based clustering , quantify our results by the way of explanation s and list conclusions. KeywordsClustering algorithm OPTICS, Ordering Points To Identify the Clustering Structure, Sequence mining. Average Inter cluster, Intra cluster. INTRODUCTIONThe web is a huge database for research about relationship between objects, people, socials, companies, relations, marketing, management, knowledge and etc. Clustering is a one of the ways to collecting subsets of data that have some common attributes and find hidden patterns to create knowledge from databases. Different types of data clustering are: Hierarchical, Partitional, Density-based, Sub-space clustering and etc. In this paper we use a density-based clustering algorithm that name is OPTICS algorithm to clustering a dataset from msnbc.com website. Because this algorithm is a density based algorithm and our data is a density based data. First of all we downloaded data (around 40'000 records) from the msnbc.com website and then created a dataset file. We did some preprocessing on the dataset to extract impossible data combinations for example it is possible that some attributes have never been used by the users. After accredit from the data we apply OPTICS algorithm on the new dataset to cluster the data. Finally use Euclidean distance measure, projected Euclidean distance, cosine similarity and Fuzzy dissimilarity to compare the results for intra cluster and inter cluster analyze and visualize data and graphs. RELATED WORK OPTICSThe OPTICS algorithm (Ordering Points To Identify the Clustering Structure) algorithm designed by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander. Its basic idea is similar to DBSCAN (Density-Based Spatial Clustering of Applications with Noise) to specify clusters and noises of a special database. OPTICS defines a cluster with the base of density. In this algorithm point "p" is a cluster if it contains minimum points (call "MinPts") that are not farther than the defin...
Web usage mining is the application of data mining techniques to web log data repositories. It is used in finding the user access patterns from web access log. User page visits are sequential in nature. In this paper we presented clustering web transactions based on the set similarity measures from web log data which identifies the behavior of the users page visits, order of occurrence of visits. Web data Clusters are formed using the Similarity Upper Approximations. We present the experimental results on MSNBC web navigation dataset which are sequential in nature. clustering in web usage mining is finding the groups which share common interests and behavior by analyzing the data collected in the web servers. This study contributes the topic clustering of web usage data and shows the interests and behaviors of the various user visits .
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.