Separating Significant Matches from Spurious Matches in DNA Sequences

Devillers, Hugo; Schbath, Sophie

doi:10.1089/cmb.2011.0070

Cited by 5 publications

(4 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…If one wants to estimate phylogenetic distances between genomic sequences based on spaced-word matches between them, one needs to distinguish between matches representing true homologies and random background matches ( Devillers and Schbath, 2012 ). One possible way of reducing the number of background spaced-word matches would be to use a sufficiently high weight w , i.e.…”

Section: Algorithmmentioning

confidence: 99%

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

2017

View full text Add to dashboard Cite

MotivationWord-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods.ResultsWe propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes.Availability and ImplementationThe program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Section: Algorithmmentioning

confidence: 99%

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

2017

View full text Add to dashboard Cite

show abstract

“…The total is about ο ( λnm 2 ). In this study, the fast K-Nearest Neighbor Graph (K-NNG) construction method 48 49 is applied to the construction of the weighted sample graph, which reduces the time complexity from ο ( λnm 2 ) to ο ( λnm 1 14 ).…”

Section: Methodsmentioning

confidence: 99%

“…Scoring functions represent the core of ranking methods and are used to assign a relevance index to each feature/gene. The scoring functions mainly include the Z-score 11 and Welch t-test 12 from the t-test family, the Bayesian t-test 13 from the Bayesian scoring family, and the Info gain 14 method from the theory-based scoring family. However, the filter-ranking methods ignore the correlations among gene subset, so the selected gene subset may contain redundant information.…”

Section: Related Workmentioning

confidence: 99%

Feature Subset Selection for Cancer Classification Using Weight Local Modularity

Zhao

2016

Sci Rep

View full text Add to dashboard Cite

Microarray is recently becoming an important tool for profiling the global gene expression patterns of tissues. Gene selection is a popular technology for cancer classification that aims to identify a small number of informative genes from thousands of genes that may contribute to the occurrence of cancers to obtain a high predictive accuracy. This technique has been extensively studied in recent years. This study develops a novel feature selection (FS) method for gene subset selection by utilizing the Weight Local Modularity (WLM) in a complex network, called the WLMGS. In the proposed method, the discriminative power of gene subset is evaluated by using the weight local modularity of a weighted sample graph in the gene subset where the intra-class distance is small and the inter-class distance is large. A higher local modularity of the gene subset corresponds to a greater discriminative of the gene subset. With the use of forward search strategy, a more informative gene subset as a group can be selected for the classification process. Computational experiments show that the proposed algorithm can select a small subset of the predictive gene as a group while preserving classification accuracy.

show abstract

“…Average lengths of the repeats are given in Gu et al ( 2000 ). Recently, heuristics have been proposed and implemented (Devillers and Schbath, 2012 ; Rizk et al, 2013 ; Chikhi and Medvedev, 2014 ).…”

Section: Introductionmentioning

confidence: 99%

Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes

Régnier

Chassignet

2016

Front. Bioeng. Biotechnol.

View full text Add to dashboard Cite

Repetitive patterns in genomic sequences have a great biological significance and also algorithmic implications. Analytic combinatorics allow to derive formula for the expected length of repetitions in a random sequence. Asymptotic results, which generalize previous works on a binary alphabet, are easily computable. Simulations on random sequences show their accuracy. As an application, the sample case of Archaea genomes illustrates how biological sequences may differ from random sequences.

show abstract

Separating Significant Matches from Spurious Matches in DNA Sequences

Cited by 5 publications

References 36 publications

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

Fast and accurate phylogeny reconstruction using filtered spaced-word matches

Feature Subset Selection for Cancer Classification Using Weight Local Modularity

Accurate Prediction of the Statistics of Repetitions in Random Sequences: A Case Study in Archaea Genomes

Contact Info

Product

Resources

About