The identification of potential regulatory motifs in new sequence data is increasingly important for experimental design. Those motifs are commonly located by matches to IUPAC strings derived from consensus sequences. Although this method is simple and widely used, a major drawback of IUPAC strings is that they necessarily remove much of the information originally present in the set of sequences. Nucleotide distribution matrices retain most of the information and are thus better suited to evaluate new potential sites. However, sufficiently large libraries of pre-compiled matrices are a prerequisite for practical application of any matrix-based approach and are just beginning to emerge. Here we present a set of tools for molecular biologists that allows generation of new matrices and detection of potential sequence matches by automatic searches with a library of pre-compiled matrices. We also supply a large library (> 200) of transcription factor binding site matrices that has been compiled on the basis of published matrices as well as entries from the TRANSFAC database, with emphasis on sequences with experimentally verified binding capacity. Our search method includes position weighting of the matrices based on the information content of individual positions and calculates a relative matrix similarity. We show several examples suggesting that this matrix similarity is useful in estimating the functional potential of matrix matches and thus provides a valuable basis for designing appropriate experiments.
The publication of the first almost complete sequence of a human chromosome (chromosome 22) is a major milestone in human genomics. Together with the sequence, an excellent annotation of genes was published which certainly will serve as an information resource for numerous future projects. We noted that the annotation did not cover regulatory regions; in particular, no promoter annotation has been provided. Here we present an analysis of the complete published chromosome 22 sequence for promoters. A recent breakthrough in specific in silico prediction of promoter regions enabled us to attempt large-scale prediction of promoter regions on chromosome 22. Scanning of sequence databases revealed only 20 experimentally verified promoters, of which 10 were correctly predicted by our approach. Nearly 40% of our 465 predicted promoter regions are supported by the currently available gene annotation. Promoter finding also provides a biologically meaningful method for "chromosomal scaffolding", by which long genomic sequences can be divided into segments starting with a gene. As one example, the combination of promoter region prediction with exon/intron structure predictions greatly enhances the specificity of de novo gene finding. The present study demonstrates that it is possible to identify promoters in silico on the chromosomal level with sufficient reliability for experimental planning and indicates that a wealth of information about regulatory regions can be extracted from current large-scale (megabase) sequencing projects. Results are available on-line at http://genomatix.gsf.de/chr22/.The human genome sequencing project completed the first major milestone with the publication of most of the euchromatic part of human chromosome 22 (Dunham et al. 1999). The consortium identified a total of 545 genes using a careful approach, relying primarily on the mapping of experimental data such as cDNAs and EST clusters. In silico predictions were used to identify genomic data such as CpG islands and repetitive sequence contents.The promoter of a gene is generally located in its 5Ј region and contains vital information about gene expression and regulatory networks, including gene targets of individual transcriptional cascades/signaling pathways. However, cDNAs and EST clusters are often 5Ј incomplete and thus do not provide reliable information about promoters. This and the scarcity of experimental data regarding promoters are probably the major reasons why no corresponding annotation for promoters was attempted.It has not been possible thus far to predict polymerase II promoters in silico with sufficient specificity in the context of large genomic sequences. This problem was highlighted by the publication of the GASP project (Reese et al. 2000). We recently developed a new method called PromoterInspector (Scherf et al. 2000) to locate genomic regions of about 0.2 kb to 2 kb which contain or overlap with polymerase II promoters. We showed that PromoterInspector is capable of predicting promoter regions in sequences over 1 Mb in...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.