Transcription regulation is controlled by coordinated binding of one or more transcription factors in the promoter regions of genes. In many species, especially higher eukaryotes, transcription factor binding sites tend to occur as homotypic or heterotypic clusters, also known as cis-regulatory modules. The number of sites and distances between the sites, however, vary greatly in a module. We propose a statistical model to describe the underlying cluster structure as well as individual motif conservation and develop a Monte Carlo motif screening strategy for predicting novel regulatory modules in upstream sequences of coregulated genes. We demonstrate the power of the method with examples ranging from bacterial to insect and human genomes.evolutionary Monte Carlo ͉ gene regulation ͉ hidden Markov models ͉ transcription factor binding sites T ranscription factor binding sites (TFBSs) are short sequence segments (Ϸ10 bp) located near genes' transcription start sites (TSSs) and are recognized by respective transcription factors (TFs) for gene regulation. Laboratory assays such as electrophoretic mobility shift assays and DNase footprinting have been developed to locate TFBSs on a gene-by-gene and site-by-site basis, but these methods are laborious, time-consuming, and unsuitable for largescale studies. Computational methods thus have become necessary for genome-wide analyses of transcription regulation.TFBSs recognized by the same TF usually show a conserved pattern, which is often called a TF binding motif (TFBM) and modeled by a position-specific weight matrix (PSWM) with each of its columns describing the occurrence frequencies of the four nucleotides in the corresponding motif position. Over the past decade, a spate of computational methods have been developed to infer TFBMs for sets of coregulated genes (1-9). There also have been efforts to improve motif prediction by correlating sequence patterns with mRNA expression data (10, 11) or by using comparative genomics information (12)(13)(14). Although these methods have been very successful for bacterial and yeast genomes, they have met with limited success in mammalian genomes.The main difficulties with in silico TFBM predictions in high eukaryotes include the increased volume of the sequence search space, with proximal TFBSs occurring a few kilobases away from the TSSs; the increased occurrence of low-complexity repeats; the increased complexity in combinatorial controls; and shorter and less-conserved TFBSs. Despite these challenges, there are two possible redeeming factors: (i) many eukaryotic genomes have been or are being sequenced, and comparative genomic analysis can be extremely powerful; and (ii) most eukaryotic genes are controlled by a combination of factors with the corresponding binding sites forming homotypic or heterotypic clusters known as ''cis-regulatory modules'' (CRMs) (15,16). A statistical model that can explicitly incorporate the CRM concept is likely to bring out more information.Most available approaches for discovering CRMs have concen...