Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 3 10 À6 ) was observed with CCDC62 (SKAT-O [p ¼ 6.89 3 10 À7 ], combined multivariate collapsing [p ¼ 1.48 3 10 À6 ], and burden of rare variants [p ¼ 1.48 3 10 À6 ]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.Massively parallel sequencing technologies are generating an unprecedented amount of sequence data on various kinds of samples including human exomes and genomes. Many rare variant association methods have been developed to elucidate the underlying disease etiology using large-scale population-based sequence datasets. 1-5 Although some findings are promising, 6 statistical power analyses performed with simulated data demonstrate that large sample sizes of tens or even hundreds of thousands of individuals are required for adequately powered studies. 7,8 Large-scale genetic epidemiological studies are currently ongoing, including the Trans-Omics for Precision Medicine program (TopMed) (see Web Resources) and UK BioBank 9 studies. Additional large-scale genetic epidemiological studies are emerging that will generate wholegenome sequence (WGS) data or impute WGS data into existing genotype array data to better understand the genetic etiology of complex traits.It is problematic to analyze large datasets of massively parallel sequence data given the limitations of current analytic tools for annotation, data quality control, and association testing. 9,10 Analytic tools such as PLINK/SEQ and ...