Processing large amount of data with many input features is always time
consuming and expensive. In machine learning (ML), the number of input
features play a crucial role in determining the performance of the ML
models. Studies show that ML has potential for dimensionality reduction.
This work proposes a methodology to reduce the number of input features
using ML to facilitate cost-effective data analysis. Two different data
sets for water quality prediction from Kaggle are used to run the ML
models. First, we use Recursive Feature Elimination with
Cross-Validation (RFECV), Permutation Importance (PI), and Random Forest
(RF) models to find the impact of input features on predicting water
quality. Second, we conduct experiments applying seven ML models: RF,
Decision Tree (DT), Logistic Regression (LR), K-Nearest Neighbors (KNN),
Gaussian Naïve Bayes (GNB), Support Vector Machine (SVM), and Deep
Neural Network (DNN) to explore water quality using the original and
reduced datasets. Third, we evaluate the impact of the optimized data
features on computations and cost to test water quality. Experimental
results show that reducing the number of features from nine to five for
Dataset 1 helps reduce computations by up to 59% and cost up to 65%.
Similarly, reducing the number of features from 20 to 16 for Dataset 2
helps reduce computations by up to 20% and cost up to 14%. This study
may help mitigate the curse of dimensionality, via improving the
performance of ML models by enhancing data generalization.