23Genome-wide and phenome-wide association studies are commonly used to identify 24 important relationships between genetic variants and phenotypes. Most of these studies have 25 treated diseases as independent variables and suffered from heavy multiple adjustment burdens 26 due to the large number of genetic variants and disease phenotypes. In this study, we propose 27 using topic modeling via non-negative matrix factorization (NMF) for identifying associations 28 between disease phenotypes and genetic variants. Topic modeling is an unsupervised machine 29 learning approach that can be used to learn the semantic patterns from electronic health record 30 data. We chose rs10455872 in LPA as the predictor since it has been shown to be associated with 31 increased risk of hyperlipidemia and cardiovascular diseases (CVD). Using data of 12,759 32 individuals from the biobank at Vanderbilt University Medical Center, we trained a topic model 33 using NMF from 1,853 distinct phecodes extracted from the cohort's electronic health records 34 and generated six topics. We quantified their associations with rs10455872 in LPA. Topics 35 indicating CVD had positive correlations with rs10455872 (P < 0.001), replicating a previous 36 finding. We also identified a negative correlation between LPA and a topic representing lung 37 cancer (P < 0.001). Our results demonstrate the applicability of topic modeling in exploring the 38 relationship between the genome and clinical diseases. 39 40 Author summary 41 Identifying the clinical associations of genetic variants remains crucial in understanding 42 how the human genome modulates disease risk. Traditional phenome-wide association studies 43 consider each disease phenotype as an independent variable, however, diseases often present as 44 complex clusters of comorbid conditions. In this study, we propose using topic modeling to 45 model electronic health record data as a mixture of topics (e.g., disease clusters or relevant 46 comorbidities) and testing associations between topics and genetic variants. Our results 47 demonstrated the feasibility of using topic modeling to replicate and discover novel associations 48 between the human genome and clinical diseases. 49 50 51 52 Introduction 53 Elucidating associations between genetic variants and human diseases creates new 54 avenues for disease prevention and enables more precise treatment of diseases [1,2]. During the 55 past two decades, genetic studies have uncovered thousands of genetic variants that influence 56 risk for disease phenotypes [3], e.g., the discovery of a variant in proprotein convertase 57 subtilisin/kexin type 9 (PCSK9[4]) associated with low plasma low-density lipoprotein, which 58 led to a new therapeutic drug class that was approved by the US Food and Drug Administration 59 in 2015. Many of these discoveries come from large-scale association analyses. The two most 60 notable approaches are genome-wide (GWAS) and phenome-wide association studies (PheWAS) 61 [2, 5]. For a given phenotype, GWAS scans hundreds of thou...