2005
DOI: 10.1093/bioinformatics/bti806
|View full text |Cite
|
Sign up to set email alerts
|

Application of compression-based distance measures to protein sequence classification: a methodological study

Abstract: We constructed compression-based distance measures (CBMs) using the Lempel-Zlv and the PPMZ compression algorithms and compared their performance with that of the Smith-Waterman algorithm and BLAST, using nearest neighbour or support vector machine classification schemes. The datasets included a subset of the SCOP protein structure database to test distant protein similarities, a 3-phosphoglycerate-kinase sequences selected from archaean, bacterial and eukaryotic species as well as low and high-complexity sequ… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
55
0

Year Published

2006
2006
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 64 publications
(55 citation statements)
references
References 27 publications
0
55
0
Order By: Relevance
“…The problem is very complicated and non-trivial. Proper selection of the protein domain is necessary [102][103][104][105][106][107][108]. In addition to pure chemical data [109][110][111][112][113][114][115][116] in the context of the Drug Discovery [117][118][119][120][121][122][123][124][125][126][127], there is also a need for some knowledge on protein-protein interactions, the high quality structural prediction of proteins [2,[128][129][130][131][132][133][134][135][136] and their inhibitors, and a detailed understanding of how those inhibitors affect the molecular recognition between proteins.…”
Section: Resultsmentioning
confidence: 99%
“…The problem is very complicated and non-trivial. Proper selection of the protein domain is necessary [102][103][104][105][106][107][108]. In addition to pure chemical data [109][110][111][112][113][114][115][116] in the context of the Drug Discovery [117][118][119][120][121][122][123][124][125][126][127], there is also a need for some knowledge on protein-protein interactions, the high quality structural prediction of proteins [2,[128][129][130][131][132][133][134][135][136] and their inhibitors, and a detailed understanding of how those inhibitors affect the molecular recognition between proteins.…”
Section: Resultsmentioning
confidence: 99%
“…This theory has found many applications in pattern recognition, phylogeny, clustering and classification. For objects that are represented as computer files such applications range from weather forecasting, software, earthquake prediction, music, literature, ocr, bioinformatics, to internet [1], [2], [5], [8]- [10], [12], [19], [20], [22], [23], [25], [31]- [33], [40]. For objects that are only represented by name, or objects that are abstract like "red," "Einstein," "three," the normalized information distance uses background information provided by Google, or any search engine that produces aggregate page counts.…”
mentioning
confidence: 99%
“…Previous studies have indicated the importance of compression algorithms in comparing the similarity of sequence data. In this study we used bzip2 to calculate the alignment free sequence similarity of sequences using compression-based metrics (CBM) (Cilibrasi and Vitanyi, 2005;Kocsor et al, 2006):…”
Section: Descriptors To Represent a Protein Pairmentioning
confidence: 99%
“…The field of automatic functional annotation is rapidly evolving. Widely accepted approaches include the use of sequence alignment tools, such as BLAST, PSI-BLAST (Altschul et al, 1997), FASTA (Pearson, 1996) or more sensitive methods, such as hidden Markov models (Sonnhammer et al, 1997) or alignment-independent tools (Kocsor et al, 2006). These methods are applied to detect a set of candidate genes, which were previously manually annotated.…”
Section: Introductionmentioning
confidence: 99%