Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification

Marsan, Laurent; Sagot, Marie‐France

doi:10.1145/332306.332553

Cited by 66 publications

(43 citation statements)

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A resemblance exists between this structure and the work related to regulatory motifs [71,66,83,60] and probabilistic suffix trees [78,82,69]. Regulatory motifs characterize short sequences of DNA and determine the timing location and level of gene expression, and the approaches extracting regulatory motifs can be divided into two categories: those that exploit word-counting heuristics [57,69] and those based on the use of probabilistic models [40,48,64,79,85,87]; in the second category of approaches, the motifs are represented by position probabilistic matrices, whereas the remainder of the sequences are represented by background models. The probabilistic or prediction suffix tree is basically a stochastic model that employs a suffix tree as its index structure to represent compactly the conditional probabilities distribution for a cluster of sequences.…”

Section: Index Structures For Weighted Stringsmentioning

confidence: 84%

“…The probabilistic or prediction suffix tree is basically a stochastic model that employs a suffix tree as its index structure to represent compactly the conditional probabilities distribution for a cluster of sequences. Each node of a probabilistic suffix tree is associated with a probability vector that stores the probability distribution for the next symbol given the label of the node as the preceding segment, and algorithms that use probabilistic suffix trees to process regulatory motifs can be found in [82,69]. However, the probabilistic suffix tree is inefficient for efficiently handling weighted sequences, which is why the weighted suffix tree was introduced; however, it could be possible for a suitable combination of the two structures to be effective to handle both problem categories.…”

Section: Index Structures For Weighted Stringsmentioning

confidence: 99%

See 1 more Smart Citation

String Data Structures for Computational Molecular Biology

Makris

Theodoridis

2010

Algorithms in Computational Molecular Biology

View full text Add to dashboard Cite

Section: Index Structures For Weighted Stringsmentioning

confidence: 84%

Section: Index Structures For Weighted Stringsmentioning

confidence: 99%

String Data Structures for Computational Molecular Biology

Makris

Theodoridis

2010

Algorithms in Computational Molecular Biology

View full text Add to dashboard Cite

“…Recently, several methods have been suggested to identify occurrences of known CRMs (Berman et al, 2002;Frith et al, 2001) and to find novel CRMs given a database of known motifs (Sharan et al, 2003;Kel-Margoulis et al, 2002;Aerts et al, 2003), but these methods are restricted to TFs whose binding sites have been previously characterized. To date, we are aware of only one approach that tries to identify novel CRMs and at the same time learn their component motifs de novo (Marsan and Sagot, 2000). A shortcoming of the latter approach is that it is based on a consensus sequence representation of a motif, which has less expressive power compared to the more widely used position weight matrix model.…”

Section: Identifying Spatial Cis-regulatory Modules 823mentioning

confidence: 99%

A discriminative model for identifying spatial cis-regulatory modules

Segal

Sharan

2004

Proceedings of the Eighth Annual International Conference on Computational Molecular Biology - RECOMB '04

View full text Add to dashboard Cite

Transcriptional regulation is mediated by the coordinated binding of transcription factors to the upstream regions of genes. In higher eukaryotes, the binding sites of cooperating transcription factors are organized into short sequence units, called cis-regulatory modules. In this paper, we propose a method for identifying modules of transcription factor binding sites in a set of co-regulated genes, using only the raw sequence data as input. Our method is based on a novel probabilistic model that describes the mechanism of cis-regulation, including the binding sites of cooperating transcription factors, the organization of these binding sites into short sequence modules, and the regulation of a gene by its modules. We show that our method is successful in discovering planted modules in simulated data and known modules in yeast. More importantly, we applied our method to a large collection of human gene sets and found 83 significant cis-regulatory modules, which included 36 known motifs and many novel ones. Thus, our results provide one of the first comprehensive compendiums of putative cis-regulatory modules in human.

show abstract

“…Some recent methods attempt to incorporate siteclustering information with de novo motif discovery by building a rule to discriminate modules preserving a certain ordering of motifs from sequences with random occurrences of motifs (20,21). However, these methods do not explicitly specify a probability model and impose restrictive conditions such as a known number of motifs in the module or a known number of occurrences of each motif in the module.…”

mentioning

confidence: 99%

De novo cis-regulatory module elicitation for eukaryotic genomes

Gupta

Liu

2005

Proc. Natl. Acad. Sci. U.S.A.

116

115

View full text Add to dashboard Cite

Transcription regulation is controlled by coordinated binding of one or more transcription factors in the promoter regions of genes. In many species, especially higher eukaryotes, transcription factor binding sites tend to occur as homotypic or heterotypic clusters, also known as cis-regulatory modules. The number of sites and distances between the sites, however, vary greatly in a module. We propose a statistical model to describe the underlying cluster structure as well as individual motif conservation and develop a Monte Carlo motif screening strategy for predicting novel regulatory modules in upstream sequences of coregulated genes. We demonstrate the power of the method with examples ranging from bacterial to insect and human genomes.evolutionary Monte Carlo ͉ gene regulation ͉ hidden Markov models ͉ transcription factor binding sites T ranscription factor binding sites (TFBSs) are short sequence segments (Ϸ10 bp) located near genes' transcription start sites (TSSs) and are recognized by respective transcription factors (TFs) for gene regulation. Laboratory assays such as electrophoretic mobility shift assays and DNase footprinting have been developed to locate TFBSs on a gene-by-gene and site-by-site basis, but these methods are laborious, time-consuming, and unsuitable for largescale studies. Computational methods thus have become necessary for genome-wide analyses of transcription regulation.TFBSs recognized by the same TF usually show a conserved pattern, which is often called a TF binding motif (TFBM) and modeled by a position-specific weight matrix (PSWM) with each of its columns describing the occurrence frequencies of the four nucleotides in the corresponding motif position. Over the past decade, a spate of computational methods have been developed to infer TFBMs for sets of coregulated genes (1-9). There also have been efforts to improve motif prediction by correlating sequence patterns with mRNA expression data (10, 11) or by using comparative genomics information (12)(13)(14). Although these methods have been very successful for bacterial and yeast genomes, they have met with limited success in mammalian genomes.The main difficulties with in silico TFBM predictions in high eukaryotes include the increased volume of the sequence search space, with proximal TFBSs occurring a few kilobases away from the TSSs; the increased occurrence of low-complexity repeats; the increased complexity in combinatorial controls; and shorter and less-conserved TFBSs. Despite these challenges, there are two possible redeeming factors: (i) many eukaryotic genomes have been or are being sequenced, and comparative genomic analysis can be extremely powerful; and (ii) most eukaryotic genes are controlled by a combination of factors with the corresponding binding sites forming homotypic or heterotypic clusters known as ''cis-regulatory modules'' (CRMs) (15,16). A statistical model that can explicitly incorporate the CRM concept is likely to bring out more information.Most available approaches for discovering CRMs have concen...

show abstract

Extracting structured motifs using a suffix tree—algorithms and application to promoter consensus identification

Cited by 66 publications

References 19 publications

String Data Structures for Computational Molecular Biology

String Data Structures for Computational Molecular Biology

A discriminative model for identifying spatial cis-regulatory modules

De novo cis-regulatory module elicitation for eukaryotic genomes

Contact Info

Product

Resources

About