Large Unbalanced Credit Scoring Using Lasso-Logistic Regression Ensemble

Hong, Wang; Xu, Qingsong; Zhou, Lifeng

doi:10.1371/journal.pone.0117844

Cited by 89 publications

(60 citation statements)

References 50 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…These diverse classifiers can be combined using a stacked ensemble approach, which involves using a training data set to empirically fit a function that combines the classifiers to produce a single classification result 12. Depending on the outcome to be classified, stacked ensemble approaches have used a variety of linear, polynomial and logistic regression models to integrate the individual classifiers into a single prediction 11 13. This approach provides an opportunity to combine the information from multiple job components, which may vary in their relative strength of association with the occupation classification, to identify the best occupation classification.…”

Section: Introductionmentioning

confidence: 99%

Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies

Russ

Colt

et al. 2016

Occup Environ Med

View full text Add to dashboard Cite

Background Mapping job titles to standardized occupation classification (SOC) codes is an important step in identifying occupational risk factors in epidemiologic studies. Because manual coding is time-consuming and has moderate reliability, we developed an algorithm called SOCcer (Standardized Occupation Coding for Computer-assisted Epidemiologic Research) to assign SOC-2010 codes based on free-text job description components. Methods Job title and task-based classifiers were developed by comparing job descriptions to multiple sources linking job and task descriptions to SOC codes. An industry-based classifier was developed based on the SOC prevalence within an industry. These classifiers were used in a logistic model trained using 14,983 jobs with expert-assigned SOC codes to obtain empirical weights for an algorithm that scored each SOC/job description. We assigned the highest scoring SOC code to each job. SOCcer was validated in two occupational data sources by comparing SOC codes obtained from SOCcer to expert assigned SOC codes and lead exposure estimates obtained by linking SOC codes to a job-exposure matrix. Results For 11,991 case-control study jobs, SOCcer-assigned codes agreed with 44.5% and 76.3% of manually assigned codes at the 6- and 2-digit level, respectively. Agreement increased with the score, providing a mechanism to identify assignments needing review. Good agreement was observed between lead estimates based on SOCcer and manual SOC assignments (kappa: 0.6–0.8). Poorer performance was observed for inspection job descriptions, which included abbreviations and worksite-specific terminology. Conclusions Although some manual coding will remain necessary, using SOCcer may improve the efficiency of incorporating occupation into large-scale epidemiologic studies.

show abstract

Section: Introductionmentioning

confidence: 99%

Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies

Russ

Colt

et al. 2016

Occup Environ Med

View full text Add to dashboard Cite

show abstract

“…This type of threelayer MLP is a commonly adopted ANN structure for binary classification problems such as bankruptcy prediction [11]. Although the characteristics of ANN ensembles, such as efficiency, robustness, and adaptability, make them a valuable tool for classification, decision support, financial analysis, and credit scoring, it should be noted that some researchers have shown that the ensembles of multiple neural network classifiers are not always superior to a single best neural network classifier [17]. Hence, we focus on applying a single neural network model to bankruptcy prediction.…”

Section: Introductionmentioning

confidence: 99%

A Differential Evolution‐Oriented Pruning Neural Network Model for Bankruptcy Prediction

Tang

Zhu

et al. 2019

Complexity

View full text Add to dashboard Cite

Financial bankruptcy prediction is crucial for financial institutions in assessing the financial health of companies and individuals. Such work is necessary for financial institutions to establish effective prediction models to make appropriate lending decisions. In recent decades, various bankruptcy prediction models have been developed for academics and practitioners to predict the likelihood that a loan customer will go bankrupt. Among them, Artificial Neural Networks (ANNs) have been widely and effectively applied in bankruptcy prediction. Inspired by the mechanism of biological neurons, we propose an evolutionary pruning neural network (EPNN) model to conduct financial bankruptcy analysis. The EPNN possesses a dynamic dendritic structure that is trained by a global optimization learning algorithm: the Adaptive Differential Evolution algorithm with Optional External Archive (JADE). The EPNN can reduce the computational complexity by removing the superfluous and ineffective synapses and dendrites in the structure and is simultaneously able to achieve a competitive classification accuracy. After simplifying the structure, the EPNN can be entirely replaced by a logic circuit containing the comparators and the logic NOT, AND, and OR gates. This mechanism makes it feasible to apply the EPNN to bankruptcy analysis in hardware implementations. To verify the effectiveness of the EPNN, we adopt two benchmark datasets in our experiments. The experimental results reveal that the EPNN outperforms the Multilayer Perceptron (MLP) model and our previously developed preliminary pruning neural network (PNN) model in terms of accuracy, convergence speed, and Area Under the Receiver Operating Characteristics (ROC) curve (AUC). In addition, the EPNN also provides competitive and satisfactory classification performances in contrast with other commonly used classification methods.

show abstract

“…However, there are a few reported works on feature selection methods technically designed to handle the problem of imbalanced class distribution; see references . Least absolute shrinkage and selection operator (LASSO) is very popular for variable selection and is widely used for conducting complex data, but just a few of works are directly done for dealing with class‐imbalanced data . Another preprocessing strategy for addressing class‐imbalanced data is to rebalance the class distribution by sampling .…”

Section: Introductionmentioning

confidence: 99%

“…3,[19][20][21] Least absolute shrinkage and selection operator (LASSO) 22 is very popular for variable selection and is widely used for conducting complex data, but just a few of works are directly done for dealing with class-imbalanced data. 23,24 Another preprocessing strategy for addressing class-imbalanced data is to rebalance the class distribution by sampling. 25 This rebalance can be done by either utilizing undersampling or oversampling to equilibrate the sample size of two classes in training data, such as synthetic minority oversampling technique (SMOTE).…”

Section: Introductionmentioning

confidence: 99%

LASSO‐based false‐positive selection for class‐imbalanced data in metabolomics

Pan

2019

Journal of Chemometrics

View full text Add to dashboard Cite

Feature selection and rebalancing can be seen as two preprocessing ways in class‐imbalanced learning. Recently, there have been many research achievements and applications on LASSO‐type feature selection, whereas most of them are not directly designed for addressing class‐imbalanced data. In this study, we proposed a LASSO‐based stable feature selection algorithm for class‐imbalanced data analysis, and false‐positive selection (FPS) under balanced and imbalanced situations was calculated via selection frequency of each predictor in doing stable selection. The results on simulation studies and real data examples show that class imbalance contributes to avoid overselection caused by LASSO when the data are highly correlated and a lower FPS can be obtained with class‐imbalanced data than balanced one in most of cases in the same settings. A statistical explanation was given for this phenomenon. In addition, it does not need to rebalance the class‐imbalanced data for performing such LASSO‐based feature selection with a stable strategy, and to some degree, intentionally disequilibrating the balanced data could be an alternative strategy to weaken overselection and to perform biomarker identification for finding a few of most important biomarkers.

show abstract

Large Unbalanced Credit Scoring Using Lasso-Logistic Regression Ensemble

Cited by 89 publications

References 50 publications

Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies

Computer-based coding of free-text job descriptions to efficiently identify occupations in epidemiological studies

A Differential Evolution‐Oriented Pruning Neural Network Model for Bankruptcy Prediction

LASSO‐based false‐positive selection for class‐imbalanced data in metabolomics

Contact Info

Product

Resources

About