Surveying the Forests and Sampling the Trees: An overview of Classification and Regression Trees and Random Forests with applications in Survey Research

Buskirk, Trent D.

doi:10.29115/sp-2018-0003

Cited by 24 publications

(22 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…5 We use a supervised learning algorithm (random forests) to perform variable selection allowing for all possible interactions between the covariates in equation 1. Random forests is an ensemble learning technique (nonparametric) for regression that captures complex interactions and nonlinear structures in the data by using multiple decision trees that are grown from independent bootstrapped samples from the training data (see Breiman 2001;Buskirk 2018;Buskirk et al 2018). The algorithm grows independent decision trees (weak learners) from each bootstrapped sample from the training data then combines all the weak learners into a single strong learner by averaging across all of them.…”

Section: Methodsmentioning

confidence: 99%

“…This technique allows us to identify complex patterns in the data that could not be identified using conventional empirical methods (e.g., ordinary least squares [OLS] or logistic regression), and it provides with valuable information about the performance of each covariate (Hastie et al 2009). The main advantages of using random forests as a variable selection technique are that it reduces over-fitting as it aggregates over multiple trees and reduces bias when trees are grown deep enough (see Breiman 2001; Buskirk 2018; Hastie et al 2009). The analysis is divided into two steps.…”

Section: Methodsmentioning

confidence: 99%

See 1 more Smart Citation

Telephone Survey Calling Patterns, Productivity, Survey Responses, and Their Effect on Measuring Public Opinion

Shino

McCarty

2020

Field Methods

View full text Add to dashboard Cite

This study examines the effect of telephone survey dialing patterns on lab productivity and survey responses. Using an original data set of paradata from 2010 to 2017 and a machine learning technique for variable selection, we find that early and late afternoon shifts are as productive as late evening shifts for both landline and cellphone Random Digit Dialing (RDD) samples. Also, early weekdays are more productive than the weekend for the cellphone RDD samples. Most importantly, time of the interview affects survey responses; therefore, survey practitioners and scholars should be cognizant of this effect when scheduling calls.

show abstract

Section: Methodsmentioning

confidence: 99%

Section: Methodsmentioning

confidence: 99%

Telephone Survey Calling Patterns, Productivity, Survey Responses, and Their Effect on Measuring Public Opinion

Shino

McCarty

2020

Field Methods

View full text Add to dashboard Cite

show abstract

“…Can handle outliers and missing data [ 89 ] 2. Computationally fast [ 90 ] Models are based on splits that depend on previous splits; an error made in a higher split will propagate down [ 90 ] Users need to pre-specify dependent (or target) variables Abbreviations: CHAID Chi-square Automatic Interaction Detector, CART Classification and Regression Tree # Some studies applied multiple methods in tandem or in combination …”

Section: Resultsmentioning

confidence: 99%

A systematic review of the clinical application of data-driven population segmentation analysis

Shi

Kwan

Tan

et al. 2018

BMC Med Res Methodol

View full text Add to dashboard Cite

BackgroundData-driven population segmentation analysis utilizes data analytics to divide a heterogeneous population into parsimonious and relatively homogenous groups with similar healthcare characteristics. It is a promising patient-centric analysis that enables effective integrated healthcare interventions specific for each segment. Although widely applied, there is no systematic review on the clinical application of data-driven population segmentation analysis.MethodsWe carried out a systematic literature search using PubMed, Embase and Web of Science following PRISMA criteria. We included English peer-reviewed articles that applied data-driven population segmentation analysis on empirical health data. We summarized the clinical settings in which segmentation analysis was applied, compared and contrasted strengths, limitations, and practical considerations of different segmentation methods, and assessed the segmentation outcome of all included studies. The studies were assessed by two independent reviewers.ResultsWe retrieved 14,514 articles and included 216 articles. Data-driven population segmentation analysis was widely used in different clinical contexts. 163 studies examined the general population while 53 focused on specific population with certain diseases or conditions, including psychological, oncological, respiratory, cardiovascular, and gastrointestinal conditions. Variables used for segmentation in the studies are heterogeneous. Most studies (n = 170) utilized secondary data in community settings (n = 185). The most common segmentation method was latent class/profile/transition/growth analysis (n = 96) followed by K-means cluster analysis (n = 60) and hierarchical analysis (n = 50), each having its advantages, disadvantages, and practical considerations. We also identified key criteria to evaluate a segmentation framework: internal validity, external validity, identifiability/interpretability, substantiality, stability, actionability/accessibility, and parsimony.ConclusionsData-driven population segmentation has been widely applied and holds great potential in managing population health. The evaluations of segmentation outcome require the interplay of data analytics and subject matter expertise. The optimal framework for segmentation requires further research.Electronic supplementary materialThe online version of this article (10.1186/s12874-018-0584-9) contains supplementary material, which is available to authorized users.

show abstract

“…We estimate the tunning parameter using 10-fold cross-validation of the training data along with a one-standard error rule and run a random forest to predict the outcome using 100 classification trees Buskirk (2018).…”

Section: Imprfmentioning

confidence: 99%

Selection in Surveys: Using Randomized Incentives to Detect and Account for Nonresponse Bias

Dutz¹,

Huitfeldt²,

Lacouture³

et al. 2021

View full text Add to dashboard Cite

We evaluate how nonresponse affects conclusions drawn from survey data and consider how researchers can reliably test and correct for nonresponse bias. To do so, we examine a survey on labor market conditions during the COVID-19 pandemic that used randomly assigned financial incentives to encourage participation. We link the survey data to administrative data sources, allowing us to observe a ground truth for participants and nonparticipants. We find evidence of large nonresponse bias, even after correcting for observable differences between participants and nonparticipants. We apply a range of existing methods that account for nonresponse bias due to unobserved differences, including worst-case bounds, bounds that incorporate monotonicity assumptions, and approaches based on parametric and nonparametric selection models. These methods produce bounds (or point estimates) that are either too wide to be useful or far from the ground truth. We show how these shortcomings can be addressed by modeling how nonparticipation can be both active (declining to participate) and passive (not seeing the survey invitation). The model makes use of variation from the randomly assigned financial incentives, as well as the timing of reminder emails. Applying the model to our data produces bounds (or point estimates) that are narrower and closer to the ground truth than the other methods.

show abstract

Surveying the Forests and Sampling the Trees: An overview of Classification and Regression Trees and Random Forests with applications in Survey Research

Cited by 24 publications

References 14 publications

Telephone Survey Calling Patterns, Productivity, Survey Responses, and Their Effect on Measuring Public Opinion

Telephone Survey Calling Patterns, Productivity, Survey Responses, and Their Effect on Measuring Public Opinion

A systematic review of the clinical application of data-driven population segmentation analysis

Selection in Surveys: Using Randomized Incentives to Detect and Account for Nonresponse Bias

Contact Info

Product

Resources

About