33Japan 34 35 ABSTRACT 37 Genome-wide association studies (GWAS) have identified over 150,000 links between 38 common genetic variants and human traits or complex diseases. Over 80% of these 39 associations map to polymorphisms in non-coding DNA. Therefore, the challenge is 40 to identify disease-causing variants, the genes they affect, and the cells in which 41 these effects occur. We have developed a platform using ATAC-seq, DNaseI 42 footprints, NG Capture-C and machine learning to address this challenge. Applying 43 this approach to red blood cell traits identifies a significant proportion of known 44 causative variants and their effector genes, which we show can be validated by direct 45 in vivo modelling.Identification of the variation of the genome that determines the risk of common chronic and 48 infectious diseases informs on their primary causes, which leads to preventative or 49 therapeutic approaches and insights. Whilst genome-wide association studies (GWASs) 50 have identified thousands of chromosome regions 1 , the identification of the causal genes, 51 variants and cell types remains a major bottleneck. This is due to three major features of the 52 genome and its complex association with disease susceptibility. Trait-associated variants 53 are often tightly associated, through linkage disequilibrium (LD), with tens or hundreds of 54 other variants, mostly single-nucleotide polymorphisms (SNPs), any one or more of which 55 could be causal; the majority (>85%) the variants identified in GWAS lie within the non-56 coding genome 2 . Although non-coding regions are increasingly well annotated, many 57 variants do not correspond to known regulatory elements, and even when they do, it is rarely 58 known which genes these elements control, and in which cell types. New technical 59 approaches to link variants to the genes they control are rapidly improving but are often 60 limited by their sensitivity and resolution [3][4][5][6] ; and because so few causal variants have been 61 unequivocally linked to the genes they affect, the mechanisms by which non-coding variants 62 alter gene expression remain unknown in all but a few cases; and, third, the complexity of 63 gene regulation and cell/cell interactions means that knowing when in development, in which 64 cell type, in which activation state, and within which pathway(s) a causal variant exerts its 65 effect is usually impossible to predict. Although significant progress is being made, currently, 66 none of these problems has been adequately solved.
68Here, we have developed an integrated platform of experimental and computational 69 methods to prioritise likely causal variants, link them to the genes they regulate, and 70 determine the mechanism by which they alter gene function. To illustrate the approach we 71 have initially focussed on a single haematopoietic lineage: the development of mature red 72 blood cells (RBC), for which all stages of lineage specification and differentiation from a 73 haematopoietic stem cell to a RBC are known, and can be r...