An improved string composition method for sequence comparison

Lu, Guoquing; Zhang, Shunpu; Fang, Xiang

doi:10.1186/1471-2105-9-s6-s15

Cited by 38 publications

(47 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Lu et al (2008) have found two problems associated with composition vector methods: (a) there is a positive correlation between the observed count c(w k,1 … w k,k ) and the estimated expected count c 0 (w k,1 … w k,k ), and (b) a square root needs to be applied to the denominator. Without such an operation, the normalized count tends to be over-standardized.…”

Section: Word Statisticsmentioning

confidence: 99%

“…This enables building more complex, biologically realistic models with large numbers of parameters, such as Markov model (Pham and Zuegg 2004;Hao and Qi 2004;Wu et al 2006), mix model such as Markov model plus k-word distributions Kantorovitz et al 2007), and Bernoulli model assuming a known word distribution (Lu et al 2008). Although the more complex models in biological sequence comparison are general improvements over the traditional word-based models (Blaisdell 1986;Wu et al 1997Wu et al , 2001Stuart et al 2002), some problems in developing statistical models and estimating the parameters of the complex models have impeded the development and adoption of these or other more complex models.…”

Section: Introductionmentioning

confidence: 98%

“…Recently, Lu et al (2008) proposed an improved composition vector (ICV) method that takes into consideration the above problems and achieves better performance in sequence comparison. The word normalization in improved composition vector is desirable, but not sufficient, because much effort of the word normalization aims to find better ways of utilizing evolution information.…”

Section: Introductionmentioning

confidence: 99%

“…; w k;m Þ; 0 otherwiseWe then obtain where V½ f ðw k ÞjM is the variance of the k-word frequencies under Markov model (M). Similar to the work ofLu et al (2008), we normalized the frequencies of the k-words under Markov model (M), denoted by NF, as follows:…”

mentioning

confidence: 99%

See 3 more Smart Citations

Using Markov model to improve word normalization algorithm for biological sequence comparison

et al. 2011

View full text Add to dashboard Cite

There are two crucial problems with statistical measures for sequence comparison: overlapping structures and background information of words in biological sequences. Word normalization in improved composition vector method took into account these problems and achieved better performance in evolutionary analysis. The word normalization is desirable, but not sufficient, because it assumes that the four bases A, C, T, and G occur randomly with equal chance. This paper proposed an improved word normalization which uses Markov model to estimate exact k-word distribution according to observed biological sequence and thus has the ability to adjust the background information of the k-word frequencies in biological sequences. The improved word normalization was tested with three experiments and compared with the existing word normalization. The experiment results confirm that the improved word normalization using Markov model to estimate the exact k-word distribution in biological sequences is more efficient.

show abstract

Section: Word Statisticsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 98%

Section: Introductionmentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Using Markov model to improve word normalization algorithm for biological sequence comparison

et al. 2011

View full text Add to dashboard Cite

show abstract

“…This trend will probably continue with new transformations that emerge from its integrative, quantitative and impressive nature (Fuchs, 2002). This growing proliferation of data from biological sequences made possible the development of many algorithms for the analysis and mining of knowledge (Lu et al, 2008).…”

Section: Introductionmentioning

confidence: 99%

Identification and isolation of full-length cDNA sequences by sequencing and analysis of expressed sequence tags from guarana (Paullinia cupana)

Figueirêdo

Faria-Campos²,

Astolfi-Filho³

et al. 2011

Genet. Mol. Res.

View full text Add to dashboard Cite

ABSTRACT. The current intense production of biological data, generated by sequencing techniques, has created an ever-growing volume of unanalyzed data. We reevaluated data produced by the guarana (Paullinia cupana) transcriptome sequencing project to identify cDNA clones with complete coding sequences (full-length clones) and complete sequences of genes of biotechnological interest, contributing to the knowledge of biological characteristics of this organism. We analyzed 15,490 ESTs of guarana in search of clones with complete coding regions. A total of 12,402 sequences were analyzed Identification and isolation of full-length cDNA sequences using BLAST, and 4697 full-length clones were identified, responsible for the production of 2297 different proteins. Eighty-four clones were identified as full-length for N-methyltransferase and 18 were sequenced in both directions to obtain the complete genome sequence, and confirm the search made in silico for full-length clones. Phylogenetic analyses were made with the complete genome sequences of three clones, which showed only 0.017% dissimilarity; these are phylogenetically close to the caffeine synthase of Theobroma cacao. The search for full-length clones allowed the identification of numerous clones that had the complete coding region, demonstrating this to be an efficient and useful tool in the process of biological data mining. The sequencing of the complete coding region of identified full-length clones corroborated the data from the in silico search, strengthening its efficiency and utility.

show abstract

Novel Combinatorial and Information‐Theoretic Alignment‐Free Distances for Biological Data Mining

Giancarlo

Sciortino

Gabriele

et al. 2010

Algorithms in Computational Molecular Biology

View full text Add to dashboard Cite

An improved string composition method for sequence comparison

Cited by 38 publications

References 22 publications

Using Markov model to improve word normalization algorithm for biological sequence comparison

Using Markov model to improve word normalization algorithm for biological sequence comparison

Identification and isolation of full-length cDNA sequences by sequencing and analysis of expressed sequence tags from guarana (Paullinia cupana)

Novel Combinatorial and Information‐Theoretic Alignment‐Free Distances for Biological Data Mining

Contact Info

Product

Resources

About