Recent pangenome studies have revealed a large fraction of the gene content within a species exhibits presence-absence variation (PAV). However, coding regions alone provide an incomplete assessment of functional genomic sequence variation at the species level. Little to no attention has been paid to noncoding regulatory regions in pangenome studies, though these sequences directly modulate gene expression and phenotype. To uncover regulatory genetic variation, we generated chromosome-scale genome assemblies for thirty Arabidopsis thaliana accessions from multiple distinct habitats and characterized species level variation in Conserved Noncoding Sequences (CNS). Our analyses uncovered not only evidence for PAV and positional variation (PosV) but that diversity in CNS is non-random, with variants shared across different accessions. Using evolutionary analyses and chromatin accessibility data, we provide further evidence supporting roles for conserved and variable CNS in gene regulation. Characterizing species-level diversity in all functional genomic sequences may later uncover previously unknown mechanistic links between genotype and phenotype.
Introduction:Conserved noncoding DNA remains a highly understudied class of functional genomic features compared to protein-coding genes. Previous comparative genomic analyses in plants have identified stretches, generally 15-150 base pairs (bp) long ( Fig S1), of noncoding regions that are positionally-conserved with identical (or near identical) sequence across distantly related species (1-4) .These sequences, commonly referred to as Conserved Noncoding Sequences (CNS), are regions in the genome displaying much higher similarity across different taxa than expected by chance. Background mutation and genetic drift purges non-functional sequences over long evolutionary distances. Therefore, sequence conservation above expectation implies purifying selection actively conserves these CNS. Indeed, Williamson et al. (5) discovered elevated signatures of purifying selection in CNS regions compared to other classes of noncoding DNA in Capsella grandiflora . Previous studies demonstrated CNS contain transcription factor binding sites (TFBS) (2, 6, 7) .TFBS are typically 6-12 base-pair (bp) long (8) . CNS can exceed this length, as they are thought to consist of arrays of TFBS capable of recruiting independent or cooperative transcriptional protein complexes. The length of CNS enables high confidence identification of orthologous cis-regulatory elements in comparator genomes. Querying genomes for TFBS alone results in a high false positive rate, as there are >30,000 expected occurrences of a given six bp sequence expected by chance even in the relatively small (~135 Mb) Arabidopsis thaliana genome. In contrast, there is less than one expected