1995
DOI: 10.1089/cmb.1995.2.25
|View full text |Cite
|
Sign up to set email alerts
|

Method for Calculation of Probability of Matching a Bounded Regular Expression in a Random Data String

Abstract: A method is presented for determining within strict bounds the probability of matching a regular expression with a match start point in a given section of a random data string. The method in general requires time and space exponential in the number of optional characters in the regular expression, but in practice was used to determine bounds for probabilities of matching all the ProSite patterns without difficulty.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1

Citation Types

0
3
0

Year Published

1998
1998
2019
2019

Publication Types

Select...
3
3
1

Relationship

0
7

Authors

Journals

citations
Cited by 11 publications
(3 citation statements)
references
References 4 publications
0
3
0
Order By: Relevance
“…The simplest such model is to treat DNA as a sequence of independent bases (a Bernoulli stream), each occurring with a probability equivalent to their frequency in the human genome. There are approximate methods (44,45) for calculating the expected number of PQS given a certain base frequency, and we have solved this problem explicitly as well. This gives the expected density of PQS, ρ(PQS), as a function of p , the probability of any individual base being guanine: ρ(PQS)=343p12882p13+756p141098p15+2835p163357p17+2484p18 However, applying this solution to the entire human genome gives predicted frequencies for GC-patterns of 8300 and for AT-patterns of 304 000.…”
Section: Resultsmentioning
confidence: 99%
“…The simplest such model is to treat DNA as a sequence of independent bases (a Bernoulli stream), each occurring with a probability equivalent to their frequency in the human genome. There are approximate methods (44,45) for calculating the expected number of PQS given a certain base frequency, and we have solved this problem explicitly as well. This gives the expected density of PQS, ρ(PQS), as a function of p , the probability of any individual base being guanine: ρ(PQS)=343p12882p13+756p141098p15+2835p163357p17+2484p18 However, applying this solution to the entire human genome gives predicted frequencies for GC-patterns of 8300 and for AT-patterns of 304 000.…”
Section: Resultsmentioning
confidence: 99%
“…Moreover, the CXXXC-X n -CXC motif is insufficient on its own to establish likely homology of 2ds-CSab peptides. In addition to similar patterns being present in other unrelated structural motifs, such as some forms of the ICK fold (Nadezhdin et al, 2017), randomized sequence generation based on cysteine frequencies observed across the CSab superfamily (P(Cys) = 0.16) (Sewell and Durbin, 1995; showed that this cysteine motif arises in 39% of random 50-amino-acid-long cysteinerich sequences. However, although the amino acid sequences are too diverse for accurate phylogenetic analysis, evolutionary relatedness would mean they are not random.…”
Section: Single or Multiple Origins Of The 2ds-csab Motif?mentioning
confidence: 81%
“…A more obvious way of applying Occam's Razor principle for conservation problem is through the minimum description length principle, discussed later. Another way of de ning the tness measure for the conservation problem is based on the statistical signi cance of the patterns (e.g., (Waterman, et al, 1984;Staden, 1989a;Neuwald and Green, 1994;Sewell and Durbin, 1995)), de ned as follows. Suppose p 1 ; : : : ; p n are patterns such that each p i matches a subset S i of S + .…”
Section: Ranking Discovered Patterns and Functionsmentioning
confidence: 99%