Natural family-free genomic distance

Rubert, Diego P.; Martinez, Fábio Viduani; Braga, Marília D. V.

doi:10.1186/s13015-021-00183-8

Cited by 7 publications

(21 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The DCJ-indel distance d id dcj (A, B, O) is the minimum number of DCJ and indel operations required to transform A into B assuming the orthologs given by O and allowing only the genes belonging to the complement O to be inserted or deleted. It can be computed using an approach relying on the cycles and paths of a graph that represents the structural relation between genomes A and B according to the ortholog-set O [3,12] (this graph is equivalent to a consistent decomposition of the family-free relational graph, described in the next subsection and represented in Figure 1 (bottom)). Together with the weights of edges and vertices of S(A, B), the DCJ-indel distance d id dcj allows the computation of the weighted rearrangement distance wd id dcj [12]:…”

Section: Computing An Optimal Set Of Orthologs Between Two Genomesmentioning

confidence: 99%

“…Denote by OrthoFF(A, B, S) an optimal ortholog-set in S(A, B), which is an orthologset whose rearrangement distance equals GenDiFF(A, B, S). Computing the rearrangement distance GenDiFF(A, B, S) and finding an optimal orthologset OrthoFF(A, B, S) are NP-hard problems [12].…”

Section: Computing An Optimal Set Of Orthologs Between Two Genomesmentioning

confidence: 99%

“…The family-free relational graph FFR(A, B, S), shown in Figure 1 (bottom), represents all possible weighted distances corresponding to all candidate ortholog-sets in S(A, B) [12]. Given a gene m, denote the extremities of m by m h (head ) and m t (tail ).…”

Section: Family-free Relational Graphmentioning

confidence: 99%

“…The structure of D[L] has all necessary information for computing the value wd id dcj (A, B, S, O L ), therefore we can say that wd id dcj (A, B, S, O L ) = wd id dcj (D[L]) [12]. Given that S is the set of all possible sibling-sets in FFR(A, B, S), we can modify our optimization problem to…”

Section: Consistent Decompositions Of the Family-free Relational Graphmentioning

confidence: 99%

“…This model is able to infer pairwise orthologs between two genomes directly, simultaneously based on gene similarities and rearrangements. In practice, its optimization function can be solved exactly due to an ILP formulation [12] that is called FF-DCJ-Indel and also reports an optimal matching of orthologs between the two analyzed genomes. (The ILP FF-DCJ-Indel is itself based on the previous formulations for family-based approaches [7,8].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Gene orthology inference via large-scale rearrangements for partially assembled genomes

Rubert

Braga

2023

Preprint

View full text Add to dashboard Cite

Background: Recently we developed a gene orthology inference tool based on genome rearrangements (Journal of Bioinformatics and Computational Biology 19:6, 2021). Given a set of genomes our method first computes all pairwise gene similarities. Then it runs pairwise ILP comparisons to compute optimal gene matchings, which minimize, by taking the similarities into account, the weighted rearrangement distance between the analyzed genomes (a problem that is NP-hard). The gene matchings are then integrated into gene families in the final step. Although the ILP is quite efficient and could conceptually analyze genomes that are not completely assembled but split in several contigs, our tool failed in completing that task. The main reason is that each ILP pairwise comparison includes an optimal capping that connects each end of a linear segment of one genome to an end of a linear segment in the other genome, producing an exponential increase of the search space. Results: In this work, we design and implement a heuristic capping algorithm that replaces the optimal capping by clustering (based on their gene content intersections) the linear segments into m ≥ 1 subsets, whose ends are capped independently. Furthermore, in each subset, instead of allowing all possible connections, we let only the ends of content-related segments be connected. Although there is no guarantee that m is much bigger than one, and with the possible side effect of resulting in sub-optimal instead of optimal gene matchings, the heuristic works very well in practice, from both the speed performance and the quality of computed solutions. Our experiments on fruit fly genomes show two positive results. First, for complete assemblies the version with heuristic capping reports orthologies that are very similar to the orthologies computed by the version of our tool with optimal capping. Second, we were able to efficiently analyze genomes with incomplete assemblies distributed in hundreds or even thousands of contigs, obtaining orthologies that are more similar to FlyBase orthologies when compared to orthologies computed by other inference tools. We added a post-processing for refining, with the aid of the mcl algorithm, our ambiguous families (those with more than one gene per genome), improving even more the accuracy of our results. Our approach is implemented into a pipeline incorporating the pre-computation of gene similarities and the post-processing refinement of ambiguous families with mcl. We optimized several aspects of our implementation, achieving running times that are similar to the fastest alternative tools. Both the original version with optimal capping and the new modified version with heuristic capping can be downloaded from our GitLab server at gitlab.ub.uni-bielefeld.de/gi/FFGC.

show abstract

Section: Computing An Optimal Set Of Orthologs Between Two Genomesmentioning

confidence: 99%

Section: Computing An Optimal Set Of Orthologs Between Two Genomesmentioning

confidence: 99%

Section: Family-free Relational Graphmentioning

confidence: 99%

Section: Consistent Decompositions Of the Family-free Relational Graphmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Gene orthology inference via large-scale rearrangements for partially assembled genomes

Rubert

Braga

2023

Preprint

View full text Add to dashboard Cite

show abstract

Family-Free Genome Comparison

Braga,

Doerr,

Rubert

et al. 2024

Methods in Molecular Biology

View full text Add to dashboard Cite

Generalizations of the genomic rank distance to indels

et al. 2023

View full text Add to dashboard Cite

Motivation The rank distance model, introduced by Zanetti et al. (2016), represents genome rearrangements in multi-chromosomal genomes as matrix operations, which allows the reconstruction of parsimonious histories of evolution by rearrangements. We seek to generalize this model by allowing for genomes with different gene content, to accommodate a broader range of biological contexts. We approach this generalization by using a matrix representation of genomes. This leads to simple distance formulas and sorting algorithms for genomes with different gene contents, but without duplications. Results We generalize the rank distance to genomes with different gene content in two different ways. The first approach adds insertions, deletions, and the substitution of a single extremity to the basic operations. We show how to efficiently compute this distance. To avoid genomes with incomplete markers, our alternative distance, the rank-indel distance, only uses insertions and deletions of entire chromosomes. We construct phylogenetic trees with our distances and the DCJ-Indel distance for simulated data and real prokaryotic genomes, and compare them against reference trees. For simulated data, our distances outperform the DCJ-Indel distance using the Quartet metric as baseline. This suggests that rank distances are more robust for comparing distantly related species. For real prokaryotic genomes, all rearrangement-based distances yield phylogenetic trees that are topologically distant from the reference (65% similarity with Quartet metric), but are able to cluster related species within their respective clades and distinguish the Shigella strains as the farthest relative of the E. coli strains, a feature not seen in the reference tree. Availability Code and instructions available at https://github.com/meidanis-lab/rank-indel. Supplementary information Supplementary data are available at Bioinformatics online.

show abstract

Natural family-free genomic distance

Cited by 7 publications

References 28 publications

Gene orthology inference via large-scale rearrangements for partially assembled genomes

Gene orthology inference via large-scale rearrangements for partially assembled genomes

Family-Free Genome Comparison

Generalizations of the genomic rank distance to indels

Contact Info

Product

Resources

About