The data clustering with automatic program such as k-means has been a popular technique widely used in many general applications. Two interesting sub-activity of clustering process are studied in this paper, selection the number of clusters and analysis the result of data clustering. This research aims at studying the clustering validation to find appropriate number of clusters for k-means method. The characteristics of experimental data have 3 shapes and each shape have 4 datasets (100 items), which diffusion is achieved by applying a Gaussian distributed (normal distribution). This research used two techniques for clustering validation: Silhouette and Sum of Squared Errors (SSE). The research shows comparative results on data clustering configuration k from 2 to 10. The results of both Silhouette and SSE are consistent in the sense that Silhouette and SSE present appropriate number of clusters at the same k-value (Silhouette value: maximum average, SSE-value: knee point).
Image data are normally unstructured and high dimensional due to the photography technology advancement such that an image can be taken at a wide range of resolution levels. To overcome such problem, data miners may consider selecting only a minimal set of features that are really important for classifying their images. Feature selection is a popular method for reducing dimensions in data. However, most feature selection algorithms return results in form of score for each feature. It is still difficult for data miners to choose features based on such scoring scheme because they may not know which score range is the best for their data classification at hand. Therefore, in this research, we aim to assist data miners and novice data analysts on solving dimensionality problem by finding for them the best optimal set of features, instead of just reporting the scores of all features and leaving the selection step to be the burden of miners. We select optimal set of features by firstly apply clustering technique to group similar features based on their scores. We thus propose the silhouette width criterion for selecting the optimal number of clusters during the cluster analysis step. After that we perform association mining to analyze relationships that may exist among different subsets of features toward the target attribute. Our method finally reports user the best subset of features to be potentially used further for data classification. We demonstrate performance of our proposed method on the satellite forest image data in Japan.
Water is an important part of our daily lives: food, manufacture, agriculture, etc. When water is not enough for all population, it leads to many undesirable impacts including drought, famine and death. The solution to this problem is the good management of water resources. The management of water resources is planning and designing of projects related to water. The runoff prediction is one major part of planning. It is a complex process and it also needs an adequate modeling technique for accurate prediction. Therefore, we propose to use combined algorithms to improve prediction performance. Our combination includes the two powerful methods: Artificial Neural Network (ANN) and Support Vector Regression (SVR). The root mean square error (RMSE) and the correlation coefficient (R) are two criteria that we use to evaluate the model performance regarding the comparison between actual runoff and the prediction made by our model. We also compare performance of our model against the other algorithms: Linear Regression, ANN, and Support Vector Machines. The comparison results show that our proposed method shows the best performance and the combined model is also quite accurate on predicting the peak runoff values during heavy rain season.Index Terms-Runoff prediction, artificial neural network, support vector regression, Mun Basin.
Abstract-This research aims at studying the data mining role in semantic web data. Semantic web is popular in a variety of different applications, but research in data mining in semantic web data, appears less. As open source software for data mining in semantic web open source is minimal, and data model of the semantic web requires RDF or OWL format. These specific formats cannot be used directly in most data mining tools. We thus propose a methodology to mine data that appear in an RDF format. The mining process has been demonstrated through the use of R packages.Index Terms-Data mining, semantic web, R language. I. INTRODUCTIONCurrent data is not stored on a single computer, because the current is the era of information technology and social media, data can be stored in many computers on the internet, is difficult for them to access data quickly and easily. The researchers presented the technology to help manage these data called semantic web. The data in the format or the same specification as RDF/XML, N3, Turtle, N-Triples and OWL.Semantic web [1], [2] has been used in various fields such as Information Systems, Search Engine etc. Large data technology to handle with this is data mining, because the large data analyzed find patterns or relationships of data is an advantage of data mining. Research in the field of data mining in semantic web data is not yet widely, since there is a management tool for data mining of semantic web is less, and data from the semantic web is stored in a format that cannot be used directly in data mining. The research in data mining has appeared very little.Research in the field of data mining in semantic web data applied to various algorithms of data mining, such as data classification, association rule mining etc. Most research using the licensed software such as Microsoft Data Mining Extension (DMX) which is Microsoft SQL Server.From the above it can be seen that the present data are not stored on a single computer always, is difficult to put that information in the internet is analyzed find patterns or relationships with the data mining. This research has proposed methods for data mining in semantic web data. II. BACKGROUNDA. Semantic Web Semantic web, have been developed since the storage is Manuscript received December 13, 2013; revised March 14, 2014. This work was supported in part by grant from Suranaree University of Technology through the funding of Data Engineering Research Unit.The authors are with the School of Computer Engineering, Suranaree University of Technology, Nakhon Ratchasima 30000, Thailand (e-mail: chomboon.k@gmail.com).only human to understand the meaning, but the machine cannot understand it, because data without structure. Semantic web has been developed to provide useful data on the Internet that can be analyzed and applied to various tasks. The language used for defining the data structure is RDF The standardization for semantic web in the context of web 3.0 shows in Fig. 2. The components of semantic web are as follows: XML stands for Extensible ...
The aim of this paper is to improve the predictive performance of the classification process by means of building multiple data classification models based on the output from feature selection methods that use ensemble strategy to find the optimal set of features. Currently, the data volume has grown at an extreme rate causing a variety of problems. The big data situation has made automatic analysis tasks such as data classification facing low performance and high computational time problems when dealing with big data that are huge in both volume and dimensions. In this research work, we tackle the big data problem in the high dimensionality aspect. We propose an ensemble method to reduce data dimension by means of feature clustering to rank the potential features and also return suitable subset of features for further classifying the training data. The two paradigms of feature selection based on ensemble strategy are proposed and evaluated. Experimental results confirm the efficacy of our proposed feature ensemble method.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.