In the era of "big data", it is becoming more of a challenge to not only build state-of-the-art predictive models, but also gain an understanding of what's really going on in the data. For example, it is often of interest to know which, if any, of the predictors in a fitted model are relatively influential on the predicted outcome. Some modern algorithms-like random forests (RFs) and gradient boosted decision trees (GBMs)-have a natural way of quantifying the importance or relative influence of each feature. Other algorithms-like naive Bayes classifiers and support vector machines-are not capable of doing so and model-agnostic approaches are generally used to measure each predictor's importance. Enter vip, an R package for constructing variable importance scores/plots for many types of supervised learning algorithms using model-specific and novel model-agnostic approaches. We'll also discuss a novel way to display both feature importance and feature effects together using sparklines, a very small line chart conveying the general shape or variation in some feature that can be directly embedded in text or tables. 1 Although "interpretability" is difficult to formally define in the context of ML, we follow Doshi-Velez and Kim (2017) and describe "interpretable" as the ".. . ability to explain or to present in understandable terms to a human." 2 In this context "importance" can be defined in a number of different ways. In general, we can describe it as the extent to which a feature has a "meaningful" impact on the predicted outcome. A more formal definition and treatment can be found in van der Laan (2006). 3 We refer the reader to Poulin et al. (2006), Caruana et al. (2015), Bibal and Frénay (2016), and Bau et al. (2017), for discussions around model structure interpretation.
Residual diagnostics is an important topic in the classroom, but it is less often used in practice when the response is binary or ordinal. Part of the reason for this is that generalized models for discrete data, like cumulative link models and logistic regression, do not produce standard residuals that are easily interpreted as those in ordinary linear regression. In this paper, we introduce the R package sure, which implements a recently developed idea of SUrrogate REsiduals. We demonstrate the utility of the package in detection of cumulative link model misspecification with respect to mean structures, link functions, heteroscedasticity, proportionality, and interaction effects. BootstrappingSince the surrogate residuals are based on random sampling, additional variability is introduced. One way to account for this sample variability and help stabilize any patterns in diagnostic plots is to use the bootstrap (Efron, 1979).The procedure for bootstrapping surrogate residuals is similar to the model-based bootstrap algorithm used in linear regression. To obtain the b-th bootstrap replicate of the residuals, Liu and Zhang (2017) suggest the following algorithm:Step 1 Perform a standard case-wise bootstrap of the original data to obtain the bootstrap sample X 1b , Y 1b , . . . , X nk , Y nk .Step 2 Using the procedure outlined in the previous section, obtain a sample of surrogate residuals R S 1b , . . . , R S nb using the bootstrap sample obtained in Step 1.This procedure is repeated a total of B times. For residual-vs-covariate (i.e., R-vs-x) plots and residual-vs-fitted value (i.e., R-vs-f X, β ) plots, we simply scatter all B × n residuals on the same plot. This approach is valid since the bootstrap samples are drawn independently. For large data sets, we find it useful to lower the opacity of the data points to help alleviate any issues with overplotting. For Q-Q plots, on the other hand, Liu and Zhang (2017) suggest using the median of the B bootstrap distributions, which is the implementation used in the sure package (Greenwell et al., 2017). Surrogate residuals in RThe sure package supports a variety of R packages for fitting cumulative link and other types of models. The supported packages and their corresponding functions are described in Table 2.The sure package currently exports four functions:• resids-for constructing surrogate residuals;
In the era of "big data", it is becoming more of a challenge to not only build state-of-the-art predictive models, but also gain an understanding of what's really going on in the data. For example, it is often of interest to know which, if any, of the predictors in a fitted model are relatively influential on the predicted outcome. Some modern algorithms-like random forests and gradient boosted decision trees-have a natural way of quantifying the importance or relative influence of each feature. Other algorithms-like naive Bayes classifiers and support vector machines-are not capable of doing so and model-free approaches are generally used to measure each predictor's importance. In this paper, we propose a standardized, model-based approach to measuring predictor importance across the growing spectrum of supervised learning algorithms. Our proposed method is illustrated through both simulated and real data examples. The R code to reproduce all of the figures in this paper is available in the supplementary materials.
Firewalls, especially at large organizations, process high velocity internet traffic and flag suspicious events and activities. Flagged events can be benign, such as misconfigured routers, or malignant, such as a hacker trying to gain access to a specific computer. Confounding this is that flagged events are not always obvious in their danger and the high velocity nature of the problem. Current work in firewall log analysis is manual intensive and involves manpower hours to find events to investigate. This is predominantly achieved by manually sorting firewall and intrusion detection/prevention system log data. This work aims to improve the ability of analysts to find events for cyber forensics analysis. A tabulated vector approach is proposed to create meaningful state vectors from time-oriented blocks. Multivariate and graphical analysis is then used to analyze state vectors in human-machine collaborative interface. Statistical tools, such as the Mahalanobis distance, factor analysis, and histogram matrices, are employed for outlier detection. This research also introduces the breakdown distance heuristic as a decomposition of the Mahalanobis distance, by indicating which variables contributed most to its value. This work further explores the application of the tabulated vector approach methodology on collected firewall logs. Lastly, the analytic methodologies employed are integrated into embedded analytic tools so that cyber analysts on the front-line can efficiently deploy the anomaly detection capabilities.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.