19Plants have significantly more transcription factor (TF) families than animals and fungi, and plant TF 20 families tend to contain more genes-these expansions are linked to adaptation to environmental 21 stressors (1, 2). Many TF family members bind to similar or identical sequence motifs, such as G-22 boxes (CACGTG), so it is difficult to predict regulatory relationships. We determine that the flanking 23 sequences near G-boxes help determine in vitro specificity, but that this is insufficient to predict the 24 transcription pattern of genes near G-boxes. Therefore, we construct a gene regulatory network that 25 identifies the set of bZIPs and bHLHs that are most predictive of the gene expression of genes 26 downstream of perfect G-boxes. This network accurately predicts transcriptional patterns and 27 reconstructs known regulatory subnetworks. Finally, we present Ara-BOX-cis (araboxcis.org), a 28 website that provides interactive visualisations of the G-box regulatory network, a useful resource for 29 generating predictions for gene regulatory relations.
30
INTRODUCTION
31Many transcription factors (TFs) are part of large families, with many members binding to highly 32 overlapping sets of binding sites. Therefore, any change in a TF's concentration or its spatial or 33 temporal distribution may result in unexpected cross talk within the gene regulatory network-a TF 34 may inadvertently affect the gene expression of gene targets of its other family members. This cross-35 talk phenomenon within TF families appears universal within the eukaryotes, and has been described 36 in yeast (3), plants (4) and mammalian cancer cell lines (5). Understanding the mechanisms that 37 govern how TFs within large TF families regulate their target genes is therefore an important 38 challenge.
39Understanding gene regulatory network in plants is further complicated by the fact that plants have 40 more and larger TF families than animals or fungi (1), and even larger families than would be expected through whole genome duplication alone (1, 6). For instance, the highly conserved G-box 42 motif (CACGTG) is bound by TFs in the basic-helix-loop-helix (bHLH) and basic leucine zipper (bZIP) 43 families in organisms ranging from yeasts to humans. However, in plants these two TF families have 44 massively expanded-for example, the bHLH family is now the second largest TF family in plants,
45with over 100 members in Arabidopsis, despite having arisen from an estimated 14 founder genes in 46 ancient land plants (7). At least 80 of these bHLHs have the precise amino acid composition in their 47 DNA-binding domain required to bind to G-box elements (7, 8). Many of the other bHLHs may bind to 48 E-box elements, which retain four nucleotides of the G-box core (ACGT or CANNTG). The bZIP 49 family has similarly expanded from 4 founder genes to over 70 (9).
50Moreover, both bHLHs and bZIPs bind to DNA as either homodimers or heterodimers-further 51 increasing the possible regulatory combinations. Even non-G-box binding bHLHs and other HLHs can 5...