Application of compression-based distance measures to protein sequence classification: a methodological study

Kocsor, András; Kertész‐Farkas, Attila; Kaján, László; Pongor, Sándor

doi:10.1093/bioinformatics/bti806

Cited by 64 publications

(55 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The problem is very complicated and non-trivial. Proper selection of the protein domain is necessary [102][103][104][105][106][107][108]. In addition to pure chemical data [109][110][111][112][113][114][115][116] in the context of the Drug Discovery [117][118][119][120][121][122][123][124][125][126][127], there is also a need for some knowledge on protein-protein interactions, the high quality structural prediction of proteins [2,[128][129][130][131][132][133][134][135][136] and their inhibitors, and a detailed understanding of how those inhibitors affect the molecular recognition between proteins.…”

Section: Resultsmentioning

confidence: 99%

The interactome: Predicting the protein-protein interactions in cells

Plewczyński

Ginalski

2009

Cellular and Molecular Biology Letters

View full text Add to dashboard Cite

Abstract:The term Interactome describes the set of all molecular interactions in cells, especially in the context of protein-protein interactions. These interactions are crucial for most cellular processes, so the full representation of the interaction repertoire is needed to understand the cell molecular machinery at the system biology level. In this short review, we compare various methods for predicting protein-protein interactions using sequence and structure information. The ultimate goal of those approaches is to present the complete methodology for the automatic selection of interaction partners using their amino acid sequences and/or three dimensional structures, if known. Apart from a description of each method, details of the software or web interface needed for high throughput prediction on the whole genome scale are also provided. The proposed validation of the theoretical methods using experimental data would be a better assessment of their accuracy.

show abstract

Section: Resultsmentioning

confidence: 99%

The interactome: Predicting the protein-protein interactions in cells

Plewczyński

Ginalski

2009

Cellular and Molecular Biology Letters

View full text Add to dashboard Cite

show abstract

“…This theory has found many applications in pattern recognition, phylogeny, clustering and classification. For objects that are represented as computer files such applications range from weather forecasting, software, earthquake prediction, music, literature, ocr, bioinformatics, to internet [1], [2], [5], [8]- [10], [12], [19], [20], [22], [23], [25], [31]- [33], [40]. For objects that are only represented by name, or objects that are abstract like "red," "Einstein," "three," the normalized information distance uses background information provided by Google, or any search engine that produces aggregate page counts.…”

mentioning

confidence: 99%

Information Distance in Multiples

Vitányi

2011

IEEE Trans. Inform. Theory

View full text Add to dashboard Cite

Abstract-Information distance is a parameter-free similarity measure based on compression, used in pattern recognition, data mining, phylogeny, clustering and classification. The notion of information distance is extended from pairs to multiples (finite lists). We study maximal overlap, metricity, universality, minimal overlap, additivity and normalized information distance in multiples. We use the theoretical notion of Kolmogorov complexity which for practical purposes is approximated by the length of the compressed version of the file involved, using a real-world compression program.

show abstract

“…Previous studies have indicated the importance of compression algorithms in comparing the similarity of sequence data. In this study we used bzip2 to calculate the alignment free sequence similarity of sequences using compression-based metrics (CBM) (Cilibrasi and Vitanyi, 2005;Kocsor et al, 2006):…”

Section: Descriptors To Represent a Protein Pairmentioning

confidence: 99%

“…The field of automatic functional annotation is rapidly evolving. Widely accepted approaches include the use of sequence alignment tools, such as BLAST, PSI-BLAST (Altschul et al, 1997), FASTA (Pearson, 1996) or more sensitive methods, such as hidden Markov models (Sonnhammer et al, 1997) or alignment-independent tools (Kocsor et al, 2006). These methods are applied to detect a set of candidate genes, which were previously manually annotated.…”

Section: Introductionmentioning

confidence: 99%

Beyond the ‘best’ match: machine learning annotation of protein sequences by integration of different sources of information

et al. 2008

View full text Add to dashboard Cite

Motivation: Accurate automatic assignment of protein functions remains a challenge for genome annotation. We have developed and compared the automatic annotation of four bacterial genomes employing a 5-fold cross-validation procedure and several machine learning methods. Results: The analyzed genomes were manually annotated with FunCat categories in MIPS providing a gold standard. Features describing a pair of sequences rather than each sequence alone were used. The descriptors were derived from sequence alignment scores, InterPro domains, synteny information, sequence length and calculated protein properties. Following training we scored all pairs from the validation sets, selected a pair with the highest predicted score and annotated the target protein with functional categories of the prototype protein. The data integration using machine-learning methods provided significantly higher annotation accuracy compared to the use of individual descriptors alone. The neural network approach showed the best performance. The descriptors derived from the InterPro domains and sequence similarity provided the highest contribution to the method performance. The predicted annotation scores allow differentiation of reliable versus non-reliable annotations. The developed approach was applied to annotate the protein sequences from 180 complete bacterial genomes.

show abstract

Application of compression-based distance measures to protein sequence classification: a methodological study

Cited by 64 publications

References 27 publications

The interactome: Predicting the protein-protein interactions in cells

The interactome: Predicting the protein-protein interactions in cells

Information Distance in Multiples

Beyond the ‘best’ match: machine learning annotation of protein sequences by integration of different sources of information

Contact Info

Product

Resources

About