Data sampling methods have been investigated for decades in the context of machine learning and statistical algorithms, with significant progress made in the past few years driven by strong interest in big data and distributed computing. Most recently, progress has been made in methods that can be broadly categorized into random sampling including density-biased and nonuniform sampling methods; active learning methods, which are a type of semi-supervised learning and an area of intense research; and progressive sampling methods which can be viewed as a combination of the above two approaches. A unified view of scalingdown sampling methods is presented in this article and complemented with descriptions of relevant published literature.
In this tutorial, we show how data balancing, in general, and stratified covariate balancing, in particular, can be used to benchmark clinicians. This tutorial aims to explain the concepts behind data balancing to readers who do not have a strong statistical background. Data balancing enables the analyst to compare the performance of clinicians with their peer groups on the same set of patients. The comparison is done in 3 steps. First, the patients are described in terms of their conditions/comorbidities. Each combination of comorbidities is treated as a separate type of patient. Second, the analyst measures the frequency of observing different types of patients. Third, expected outcomes are calculated for both the clinician and the peer group. The expected outcome for the clinician is calculated as the sum of product of 2 terms: probability of and the average outcome for different types of patients. The expected outcome for the peer group is calculated in the same way, with one difference: the distribution of peer group's patients is switched with the distribution of the clinician's patients. This allows us to simulate the performance of peer group on the clinician's patients. This switch in frequencies accomplishes the same goal as using propensity weights, or covariate balancing weights, but it avoids the cumbersome need to estimate the weights. In switching the distributions, a problem arises when the peer group does not see the same type of patients as the clinician. When the peer group's outcome for some patient types is missing, a synthetic case is organized. These synthetic cases are constructed from the peer group's experience with 2 complementary parts of the missing case. The reliance on synthetic cases allows one to have a match for every type of clinician's patients. Together, the synthetic case and the switch of distribution allow one to simulate the performance of the clinician and the peer group on same set of cases. The tutorial walks the reader through examples. The procedures described here can be applied to data in electronic health records. We present Standard Query Language for doing so.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.