Feature importance scores and lossless feature pruning using Banzhaf power indices

Kulynych, Bogdan; Troncoso, Carmela

doi:10.48550/arxiv.1711.04992

Cited by 2 publications

(3 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…It was reinvented by John F. Banzhaf III in 1964[Banzhaf III, 1964, and was reinvented once more by James Samuel Coleman in 1971 [Coleman, 1971] before it became part of the mainstream literature. In the field of machine learning, Banzhaf value has been previously applied to the problem of measuring feature importance [Datta et al, 2015, Kulynych and Troncoso, 2017, Sliwinski et al, 2019, Patel et al, 2021, Karczmarz et al, 2021. While these works suggest that Banzhaf value could be an alternative to the popular Shapley value-based model interpretation methods (e.g., [Lundberg and Lee, 2017]), it remains unclear in which settings the Banzhaf value may be preferable to the Shapley value.…”

Section: A Related Workmentioning

confidence: 99%

Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity

Wang¹,

Jia²

2022

Preprint

View full text Add to dashboard Cite

This paper studies the robustness of data valuation to noisy model performance scores. Particularly, we find that the inherent randomness of the widely used stochastic gradient descent can cause existing data value notions (e.g., the Shapley value and the Leave-one-out error) to produce inconsistent data value rankings across different runs. To address this challenge, we first pose a formal framework within which one can measure the robustness of a data value notion. We show that the Banzhaf value, a value notion originated from cooperative game theory literature, achieves the maximal robustness among all semivalues-a class of value notions that satisfy crucial properties entailed by ML applications. We propose an algorithm to efficiently estimate the Banzhaf value based on the Maximum Sample Reuse (MSR) principle. We derive the lower bound sample complexity for Banzhaf value approximation, and we show that our MSR algorithm's sample complexity nearly matches the lower bound. Our evaluation demonstrates that the Banzhaf value outperforms the existing semivalue-based data value notions on several downstream ML tasks such as learning with weighted samples and noisy label detection. Overall, our study suggests that when the underlying ML algorithm is stochastic, the Banzhaf value is a promising alternative to the semivalue-based data value schemes given its computational advantage and ability to robustly differentiate data quality.

show abstract

Section: A Related Workmentioning

confidence: 99%

Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity

Wang¹,

Jia²

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…The LR model's coefficients have been widely utilized for feature importance estimation [67]. Each coefficient represents a score, known as the feature importance score, which describes the significance level between the feature and the target variable.…”

Section: Feature Selectionmentioning

confidence: 99%

“…The higher the coefficient, the more relevant the feature is to the target variable. In other words, coefficients can be utilized to determine the important and unimportant features to avoid overfitting [67] and are thus useful for prediction [68]. The RFE model ranks the 104 features based on their importance scores obtained from the LR model into a list, in which the first position represents the most significant feature, while the least important feature is ranked on the last position.…”

Section: Feature Selectionmentioning

confidence: 99%

Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach

2022

View full text Add to dashboard Cite

Background The malaria risk prediction is currently limited to using advanced statistical methods, such as time series and cluster analysis on epidemiological data. Nevertheless, machine learning models have been explored to study the complexity of malaria through blood smear images and environmental data. However, to the best of our knowledge, no study analyses the contribution of Single Nucleotide Polymorphisms (SNPs) to malaria using a machine learning model. More specifically, this study aims to quantify an individual's susceptibility to the development of malaria by using risk scores obtained from the cumulative effects of SNPs, known as weighted genetic risk scores (wGRS). Results We proposed an SNP-based feature extraction algorithm that incorporates the susceptibility information of an individual to malaria to generate the feature set. However, it can become computationally expensive for a machine learning model to learn from many SNPs. Therefore, we reduced the feature set by employing the Logistic Regression and Recursive Feature Elimination (LR-RFE) method to select SNPs that improve the efficacy of our model. Next, we calculated the wGRS of the selected feature set, which is used as the model's target variables. Moreover, to compare the performance of the wGRS-only model, we calculated and evaluated the combination of wGRS with genotype frequency (wGRS + GF). Finally, Light Gradient Boosting Machine (LightGBM), eXtreme Gradient Boosting (XGBoost), and Ridge regression algorithms are utilized to establish the machine learning models for malaria risk prediction. Conclusions Our proposed approach identified SNP rs334 as the most contributing feature with an importance score of 6.224 compared to the baseline, with an importance score of 1.1314. This is an important result as prior studies have proven that rs334 is a major genetic risk factor for malaria. The analysis and comparison of the three machine learning models demonstrated that LightGBM achieves the highest model performance with a Mean Absolute Error (MAE) score of 0.0373. Furthermore, based on wGRS + GF, all models performed significantly better than wGRS alone, in which LightGBM obtained the best performance (0.0033 MAE score).

show abstract

Feature importance scores and lossless feature pruning using Banzhaf power indices

Cited by 2 publications

References 2 publications

Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity

Data Banzhaf: A Data Valuation Framework with Maximal Robustness to Learning Stochasticity

Risk score prediction model based on single nucleotide polymorphism for predicting malaria: a machine learning approach

Contact Info

Product

Resources

About