In the large cohorts that have been used for genome-wide association studies (GWAS), it is prohibitively expensive to sequence all cohort members. A cost-effective strategy is to sequence subjects with extreme values of quantitative traits or those with specific diseases. By imputing the sequencing data from the GWAS data for the cohort members who are not selected for sequencing, one can dramatically increase the number of subjects with information on rare variants. However, ignoring the uncertainties of imputed rare variants in downstream association analysis will inflate the type I error when sequenced subjects are not a random subset of the GWAS subjects. In this article, we provide a valid and efficient approach to combining observed and imputed data on rare variants. We consider commonly used gene-level association tests, all of which are constructed from the score statistic for assessing the effects of individual variants on the trait of interest. We show that the score statistic based on the observed genotypes for sequenced subjects and the imputed genotypes for nonsequenced subjects is unbiased. We derive a robust variance estimator that reflects the true variability of the score statistic regardless of the sampling scheme and imputation quality, such that the corresponding association tests always have correct type I error. We demonstrate through extensive simulation studies that the proposed tests are substantially more powerful than the use of accurately imputed variants only and the use of sequencing data alone. We provide an application to the Women's Health Initiative. The relevant software is freely available. data integration | gene-level association tests | genotype imputation | linkage disequilibrium | whole-exome sequencing R ecent technological advances have made it possible to conduct high-throughput DNA sequencing studies on rare variants, which have a stronger impact on complex diseases and traits than common variants (1). However, it is still economically infeasible to sequence all subjects in a large cohort, and, therefore, only a subset of cohort members can be selected for sequencing. A cost-effective sampling strategy is to preferentially select subjects in the extremes of a quantitative trait distribution or those with a specific disease (2, 3). For case−control studies, an equal number of cases and controls provides more power than other case−control ratios. For quantitative traits, the power increases as more extreme values are sampled (2).Trait-dependent sampling has been adopted in many sequencing studies, including the National Heart, Lung, and Blood Institute (NHLBI) Exome Sequencing Project (ESP) and the Cohorts for Heart and Aging Research in Genomic Epidemiology (CHARGE) resequencing project. The NHLBI ESP consists of three studies that sequenced subjects with the largest and smallest values of body mass index (BMI), low-density lipoprotein, and blood pressure, one case−control study on myocardial infarction, and one case-only study on stroke (2). The CHARGE resequencing pr...