Feature selection is a pre-processing technique used to remove unnecessary characteristics, and speed up the algorithm's work process. A part of the technique is carried out by calculating the information gain value of each dataset characteristic. Also, the determined threshold rate from the information gain value is used in feature selection. However, the threshold value is used freely or through a rate of 0.05. Therefore this study proposed the threshold rate determination using the information gain value’s standard deviation generated by each feature in the dataset. The threshold value determination was tested on 10 original datasets transformed by FFT and IFFT and classified using Random Forest. On processing the transformed dataset with the proposed threshold this study resulted in lower accuracy and longer execution time compared to the same process with Correlation-Base Feature Selection (CBF) and a standard 0.05 threshold method. Similarly, the required accuracy value is lower when using transformed features. The study showed that by processing the original dataset with a standard deviation threshold resulted in better feature selection accuracy of Random Forest classification. Furthermore, by using the transformed feature with the proposed threshold excluding the imaginary numbers leads to a faster average time than the three methods compared.
Abstrak-String searching merupakan suatu proses yang umum dilakukan dalam proses-proses yang dilakukan komputer karena teks merupakan bentuk utama penyimpanan data. Terdapat beberapa macam cara yang dapat dilakukan untuk mencari sebuah string pada kumpulan string lain yang lebih besar. Beberapa diantaranya adalah algoritma Boyer-Moore, Turbo Boyer-Moore dan Tuned Boyer-Moore. Guna mengetahui bagaimana performa algoritma-algoritma tersebut, terutama di bidang waktu yang diperlukan, maka dibuatlah aplikasi yang dapat digunakan untuk mengetahui waktu yang diperlukan untuk mencari suatu pattern dalam text. Aplikasi dibangun menggunakan metode prototyping dan menggunakan Microsoft Visual Studio dengan bahasa C# untuk pembangunannya. Aplikasi ini mendukung pencarian dengan penggunaan tiga algoritma (Boyer-Moore, Turbo Boyer-Moore, Tuned Boyer-Moore), pengubah kata (replace), highlight kata yang dicari, dan pemberian informasi waktu yang dibutuhkan masing-masing algoritma untuk pencarian serta algoritma mana yang membutuhkan waktu paling sedikit untuk pencarian. Dari penelitian yang dilakukan, dapat disimpulkan bahwa algoritma Boyer-Moore adalah algoritma yang paling cepat dalam pencarian string.Kata kunci-Boyer-Moore, desktop application, kecepatan algoritma, string searching, Tuned Boyer-Moore, Turbo Boyer-Moore
Random Forest is a supervised classification method based on bagging (Bootstrap aggregating) Breiman and random selection of features. The choice of features randomly assigned to the Random Forest makes it possible that the selected feature is not necessarily informative. So it is necessary to select features in the Random Forest. The purpose of choosing this feature is to select an optimal subset of features that contain valuable information in the hope of accelerating the performance of the Random Forest method. Mainly for the execution of high-dimensional datasets such as the Parkinson, CNAE-9, and Urban Land Cover dataset. The feature selection is done using the Correlation-Based Feature Selection method, using the BestFirst method. Tests were carried out 30 times using the K-Cross Fold Validation value of 10 and dividing the dataset into 70% training and 30% testing. The experiments using the Parkinson dataset obtained a time difference of 0.27 and 0.28 seconds faster than using the Random Forest method without feature selection. Likewise, the trials in the Urban Land Cover dataset had 0.04 and 0.03 seconds, while for the CNAE-9 dataset, the difference time was 2.23 and 2.81 faster than using the Random Forest method without feature selection. These experiments showed that the Random Forest processes are faster when using the first feature selection. Likewise, the accuracy value increased in the two previous experiments, while only the CNAE-9 dataset experiment gets a lower accuracy. This research’s benefits is by first performing feature selection steps using the Correlation-Base Feature Selection method can increase the speed of performance and accuracy of the Random Forest method on high-dimensional data.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.