Abstract:We present a method for automatically extracting groups of orthologous genes from a large set of genomes by a new clustering algorithm on a weighted multipartite graph. The method assigns a score to an arbitrary subset of genes from multiple genomes to assess the orthologous relationships between genes in the subset. This score is computed using sequence similarities between the member genes and the phylogenetic relationship between the corresponding genomes. An ortholog cluster is found as the subset with the… Show more
“…This is because the algorithm must: (i) iterate over all edges e ( u , v ) in C (Step 3), with the worst-case complexity O ( m ) = O ( g 2 ); and for each, (ii) look for a vertex w and edge f ( u , w ) in G (Step 4), which is at worst O ( g ) if it must look through all other genomes in the g -partite graph; and finally for each of these, (iii) check whether u and w are adjacent in G , which is an efficient O (log g ) lookup from the list of all adjacent vertices of w (or v ). The worst-case complexity of EdgeSearch is comparable to the O (V 3 ) (V = number of vertices) of another heuristic method described in Vashist et al (2007), but uses different topological information, i.e. triangles in a SymBets graph rather than dense clusters (quasi-cliques) in a graph that may include all edges and does not require a species tree.…”
Section: Resultsmentioning
confidence: 97%
“…Examples of automated implementations of the former approach include the publicly available algorithms EnsemblCompara (Vilella et al , 2009), SYNERGY (Wapinski et al , 2007), RIO (Zmasek and Eddy, 2002), Orthostrapper (Storm and Sonnhammer, 2002) and the databases of orthologous protein families HOBACGEN, HOVERGEN and HOGENOME (Dufayard et al , 2005), whereas examples of the latter include OrthoMCL (Li et al , 2003), eggNOG (Jensen et al , 2008), InParanoid and MultiParanoid (Alexeyenko et al , 2006; O'Brien et al , 2005; Remm et al , 2001), MSOAR and MultiMSOAR (Fu and Jiang, 2007; Fu et al , 2007), Homologene (Sayers et al , 2010), RoundUp (Deluca et al , 2006) and OMA (Roth et al , 2008). Still other methods exist that do not fall neatly into either category, such as that described in (Vashist et al , 2007), which uses topological distance in a species tree as a factor in a linkage equation to find dense clusters in a multipartite graph (whose edges are not restricted to SymBets). …”
Motivation: Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined.Results: In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g3) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g6). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g3log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs.Availability and implementation: C++ source code freely available for download at ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/Contact: dmk@stowers.orgSupplementary information: Supplementary materials are available at Bioinformatics online.
“…This is because the algorithm must: (i) iterate over all edges e ( u , v ) in C (Step 3), with the worst-case complexity O ( m ) = O ( g 2 ); and for each, (ii) look for a vertex w and edge f ( u , w ) in G (Step 4), which is at worst O ( g ) if it must look through all other genomes in the g -partite graph; and finally for each of these, (iii) check whether u and w are adjacent in G , which is an efficient O (log g ) lookup from the list of all adjacent vertices of w (or v ). The worst-case complexity of EdgeSearch is comparable to the O (V 3 ) (V = number of vertices) of another heuristic method described in Vashist et al (2007), but uses different topological information, i.e. triangles in a SymBets graph rather than dense clusters (quasi-cliques) in a graph that may include all edges and does not require a species tree.…”
Section: Resultsmentioning
confidence: 97%
“…Examples of automated implementations of the former approach include the publicly available algorithms EnsemblCompara (Vilella et al , 2009), SYNERGY (Wapinski et al , 2007), RIO (Zmasek and Eddy, 2002), Orthostrapper (Storm and Sonnhammer, 2002) and the databases of orthologous protein families HOBACGEN, HOVERGEN and HOGENOME (Dufayard et al , 2005), whereas examples of the latter include OrthoMCL (Li et al , 2003), eggNOG (Jensen et al , 2008), InParanoid and MultiParanoid (Alexeyenko et al , 2006; O'Brien et al , 2005; Remm et al , 2001), MSOAR and MultiMSOAR (Fu and Jiang, 2007; Fu et al , 2007), Homologene (Sayers et al , 2010), RoundUp (Deluca et al , 2006) and OMA (Roth et al , 2008). Still other methods exist that do not fall neatly into either category, such as that described in (Vashist et al , 2007), which uses topological distance in a species tree as a factor in a linkage equation to find dense clusters in a multipartite graph (whose edges are not restricted to SymBets). …”
Motivation: Identifying orthologous genes in multiple genomes is a fundamental task in comparative genomics. Construction of intergenomic symmetrical best matches (SymBets) and joining them into clusters is a popular method of ortholog definition, embodied in several software programs. Despite their wide use, the computational complexity of these programs has not been thoroughly examined.Results: In this work, we show that in the standard approach of iteration through all triangles of SymBets, the memory scales with at least the number of these triangles, O(g3) (where g = number of genomes), and construction time scales with the iteration through each pair, i.e. O(g6). We propose the EdgeSearch algorithm that iterates over edges in the SymBet graph rather than triangles of SymBets, and as a result has a worst-case complexity of only O(g3log g). Several optimizations reduce the run-time even further in realistically sparse graphs. In two real-world datasets of genomes from bacteriophages (POGs) and Mollicutes (MOGs), an implementation of the EdgeSearch algorithm runs about an order of magnitude faster than the original algorithm and scales much better with increasing number of genomes, with only minor differences in the final results, and up to 60 times faster than the popular OrthoMCL program with a 90% overlap between the identified groups of orthologs.Availability and implementation: C++ source code freely available for download at ftp.ncbi.nih.gov/pub/wolf/COGs/COGsoft/Contact: dmk@stowers.orgSupplementary information: Supplementary materials are available at Bioinformatics online.
“…This combinatorial optimization problem has been studied in [20] and it has been shown that an efficient algorithm exists for finding the global optimal solution H * if the linkage function π(i, H ) is monotone increasing. The monotone increasing property requires that the value of the linkage function for the vertex i can only increase when the second argument H increases in a set theoretic sense, i.e.…”
Section: Combinatorial Selection Of Characteristic Image Patchesmentioning
confidence: 99%
“…The algorithm for solving this combinatorial optimization problem is given [20], and is described in the pseudocode form in Algorithm 3.1. This iterative algorithm begins by calculating F (V + ) and finds the set M 1 containing the set of vertices from V + which have the minimum value of the linkage function, i.e.…”
Section: Combinatorial Selection Of Characteristic Image Patchesmentioning
confidence: 99%
“…A complexity analysis of the method can be found in [20]. It runs in O(|E| + |V | log |V |) time, where E and V are the set of edges and vertices, respectively, in the graph.…”
Section: Combinatorial Selection Of Characteristic Image Patchesmentioning
In object recognition tasks, where images are represented as constellations of image patches, often many patches correspond to the cluttered background. In this paper, we present a two-stage method for selecting the image patches which characterize the target object class and are capable of discriminating between the positive images containing the target objects and the complementary negative images. The first stage uses a combinatorial optimization formulation on a weighted multipartite graph. The following stage is a statistical method for selecting discriminative patches from the positive images. Another contribution of this paper is the part-based probabilistic method for object recognition, which uses a common reference frame instead of reference patch to avoid possible occlusion problems. We also explore different feature representation using principal component analysis (PCA) and 2D PCA. The experiment demonstrates our approach has outperformed most of the other known methods on a popular benchmark dataset while approaching the best known results.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.