The model and pattern for real time data mining have an important role for decision making. The meaningful real time data mining is basically depends on the quality of data while row or rough data available at warehouse. The data available at warehouse can be in any format, it may huge or it may unstructured. These kinds of data require some process to enhance the efficiency of data analysis. The process to make it ready to use is called data preprocessing. There can be many activities for data preprocessing such as data transformation, data cleaning, data integration, data optimization and data conversion which are use to converting the rough data to quality data. The data preprocessing techniques are the vital step for the data mining. The analyzed result will be good as far as data quality is good. This paper is about the different data preprocessing techniques which can be use for preparing the quality data for the data analysis for the available rough data.
Text Classification is vital and challenging due to varied kinds of data generated these days; emotions classification represented in form of text is more challenging due to diverse kind of emotional content and such content is growing on web these days. This research work is classifying emotions written in Hindi in form of poem with 4 categories namely Karuna, Shanta, Shringar and Veera. POS tagging is used on all the poem and then features are extracted by observing certain poetic features, two types of features are extracted and the results in terms of accuracy is measured to test the model. 180 Poetries were tagged and features were extracted with 8 different keywords, and 7 different keywords. The model is build with Random Forest, SGDClassifier and was trained with 134 poetries and tested with 46 Poetries for both types of features. The results with 7 keyword feature is comparatively better than 8 keyword feature by 7.27% for Random Forest and 10% better for SGDClassifier. Various combinations of hyper parameters are used to get the best results for statistical measure precision and recall for performance tuning of the model. The model is also tested with k – fold cross validation with average result 62.53% for 4 folds and 60.45% for 8 folds with Random Forest and 54.42% for 4 folds and 48.28% for 8 folds with SGDClassifier, the experimentation result of Random Forest is better than SGDClassifier on the given dataset.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.