SummaryThis package contains functions useful for exploratory data analysis using random forests, which can be fit using the randomForest, randomForestSRC, or party packages (Liaw and Wiener 2002;Ishwaran and Kogalur 2013;Hothorn, Hornik, and Zeileis 2006). These functions can compute the partial dependence of covariates (individually or in combination) on the fitted forests' predictions, the permutation importance of covariates, as well as the distance between data points according to the fitted model.Random forests are an attractive method for social scientists. Random forests have only a few important tuning parameters and can be adapted to do classification, regression, and clustering. Many research tasks require interpretable models, and, although methods for interpreting random forests exist, in the most popular packages for fitting random forests these methods are available inconsistently. For example although there are methods for computing permutation importance in the randomForest, randomForestSRC and party packages, only randomForest can compute local importance. None of the packages can compute permutation importance for groups of covariates. Similarly partial dependence is only implemented in randomForest, and it has limited functionality compared to the functions provided herein. This software has been used in Jones and Linder (2015); Jones and Lupu (2016). Partial dependence, as described by Friedman (2001), estimates the marginal relationship between a subset of the covariates and the model's predictions by averaging over the marginal distribution of the compliment of this subset of the covariates. This approximation allows the display of the relationship between this subset of the covariates and the model's predictions even when there are many covariates which may interact. This functionality works with models fit by any of the aforementioned packages to any of the supported types of outcome variables. partial_dependence can be parallelized and also contains a number of additional parameters which allow the user to control this approximation. There is an associated plot function plot_pd which constructs plots for a wide variety of possible outputs from partial_dependence (e.g., when pairs of covariates are considered jointly, when each covariate is considered separately, when the outcome variable is categorical, etc).Permutation importance estimates the importance of a covariate by randomly shuffling its values, breaking any dependence between said covariate and the outcome, and then computing the difference between the predictions made by the model with that covariate shuffled and the predictions made when the covariate was not shuffled. If the covariate was useful in generating predictions then the prediction errors will increase in expectation when the covariate is shuffled, whereas no such increase can be expected when the covariate has no influence. Although all three of the random forest packages provide at least one method of assessing variable importance, variable_importance provides a c...
Despite its rich tradition, there are key limitations to researchers' ability to make generalizable inferences about state policy innovation and diffusion. This paper introduces new data and methods to move from empirical analyses of single policies to the analysis of comprehensive populations of policies and rigorously inferred diffusion networks. We have gathered policy adoption data appropriate for estimating policy innovativeness and tracing diffusion ties in a targeted manner (e.g., by policy domain, time period, or policy type) and extended the development of methods necessary to accurately and efficiently infer those ties. Our state policy innovation and diffusion (SPID) database includes 728 different policies coded by topic area. We provide an overview of this new dataset and illustrate two key uses: (i) static and dynamic innovativeness measures and (ii) latent diffusion networks that capture common pathways of diffusion between states across policies. The scope of the data allows us to compare patterns in both across policy topic areas. We conclude that these new resources will enable researchers to empirically investigate classes of questions that were difficult or impossible to study previously, but whose roots go back to the origins of the political science policy innovation and diffusion literature.
Supervised machine learning methods are increasingly employed in political science. Such models require costly manual labeling of documents. In this paper, we introduce active learning, a framework in which data to be labeled by human coders are not chosen at random but rather targeted in such a way that the required amount of data to train a machine learning model can be minimized. We study the benefits of active learning using text data examples. We perform simulation studies that illustrate conditions where active learning can reduce the cost of labeling text data. We perform these simulations on three corpora that vary in size, document length, and domain. We find that in cases where the document class of interest is not balanced, researchers can label a fraction of the documents one would need using random sampling (or “passive” learning) to achieve equally performing classifiers. We further investigate how varying levels of intercoder reliability affect the active learning procedures and find that even with low reliability, active learning performs more efficiently than does random sampling.
The identification of substantively similar policy proposals in legislation is important to scholars of public policy and legislative politics. Manual approaches are prohibitively costly in constructing datasets that accurately represent policymaking across policy domains, jurisdictions, or time. We propose the use of an algorithm that identifies similar sequences of text (i.e., text reuse), applied to legislative text, to measure the similarity of the policy proposals advanced by two bills. We study bills from U.S. state legislatures. We present three ground truth tests, applied to a corpus of 500,000 bills. First, we show that bills introduced by ideologically similar sponsors exhibit a high degree of text reuse, that bills classified by the National Conference of State Legislatures as covering the same policies exhibit a high degree of text reuse, and that rates of text reuse between states correlate with policy diffusion network ties between states. In an empirical application of our similarity measure, we find that Republican state legislators introduce legislation that is more similar to legislation introduced by Republicans in other states, than is legislation introduced by Democratic state legislators to legislation introduced by Democrats in other states.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.