Baqiao Liu scite author profile

et al. 2021

Species tree inference from gene family trees is a significant problem in computational biology. However, gene tree heterogeneity, which can be caused by several factors including gene duplication and loss, makes the estimation of species trees very challenging. While there have been several species tree estimation methods introduced in recent years to specifically address gene tree heterogeneity due to gene duplication and loss (such as DupTree, FastMulRFS, ASTRAL-Pro, and SpeciesRax), many incur high cost in terms of both running time and memory. We introduce a new approach, DISCO, that decomposes the multi-copy gene family trees into many single copy trees, which allows for methods previously designed for species tree inference in a single copy gene tree context to be used. We prove that using DISCO with ASTRAL (i.e., ASTRAL-DISCO) is statistically consistent under the GDL model, provided that ASTRAL-Pro correctly roots and tags each gene family tree. We evaluate DISCO paired with different methods for estimating species trees from single copy genes (e.g., ASTRAL, ASTRID, and IQ-TREE) under a wide range of model conditions, and establish that high accuracy can be obtained even when ASTRAL-Pro is not able to correctly roots and tags the gene family trees. We also compare results using MI, an alternative decomposition strategy from Yang and Smith (2014), and find that DISCO provides better accuracy, most likely as a result of covering more of the gene family tree leafset in the output decomposition.

AOC: Assembling overlapping communities

Jakatdar

Warnow

et al. 2022

Through discovery of meso-scale structures, community detection methods contribute to the understanding of complex networks. Many community finding methods, however, rely on disjoint clustering techniques, in which node membership is restricted to one community or cluster. This strict requirement limits the ability to inclusively describe communities since some nodes may reasonably be assigned to many communities. We have previously reported Iterative K-core Clustering (IKC), a scalable and modular pipeline that discovers disjoint research communities from the scientific literature. We now present Assembling Overlapping Clusters (AOC), a complementary meta-method for overlapping communities as an option that addresses the disjoint clustering problem. We present findings from the use of AOC on a network of over 13 million nodes that captures recent research in the very rapidly growing field of extracellular vesicles in biology. Peer Review https://publons.com/publon/10.1162/qss_a_00227

An Improved Signal Processing Approach Based on Analysis Mode Decomposition and Empirical Mode Decomposition

et al. 2019

Empirical mode decomposition (EMD) is a widely used adaptive signal processing method, which has shown some shortcomings in engineering practice, such as sifting stop criteria of intrinsic mode function (IMF), mode mixing and end effect. In this paper, an improved sifting stop criterion based on the valid data segment is proposed, and is compared with the traditional one. Results show that the new sifting stop criterion avoids the influence of end effects and improves the correctness of the EMD. In addition, a novel AEMD method combining the analysis mode decomposition (AMD) and EMD is developed to solve the mode-mixing problem, in which EMD is firstly applied to dispose the original signal, and then AMD is used to decompose these mixed modes. Then, these decomposed modes are reconstituted according to a certain principle. These reconstituted components showed mode mixing phenomena alleviated. Model comparison was conducted between the proposed method with the ensemble empirical mode decomposition (EEMD), which is the mainstream method improved based on EMD. Results indicated that the AEMD and EEMD can effectively restrain the mode mixing, but the AEMD has a shorter execution time than that of EEMD.

Weighted ASTRID: fast and accurate species trees from weighted internode distances

Warnow

2023

Algorithms Mol Biol

Background Species tree estimation is a basic step in many biological research projects, but is complicated by the fact that gene trees can differ from the species tree due to processes such as incomplete lineage sorting (ILS), gene duplication and loss (GDL), and horizontal gene transfer (HGT), which can cause different regions within the genome to have different evolutionary histories (i.e., “gene tree heterogeneity”). One approach to estimating species trees in the presence of gene tree heterogeneity resulting from ILS operates by computing trees on each genomic region (i.e., computing “gene trees”) and then using these gene trees to define a matrix of average internode distances, where the internode distance in a tree T between two species x and y is the number of nodes in T between the leaves corresponding to x and y. Given such a matrix, a tree can then be computed using methods such as neighbor joining. Methods such as ASTRID and NJst (which use this basic approach) are provably statistically consistent, very fast (low degree polynomial time) and have had high accuracy under many conditions that makes them competitive with other popular species tree estimation methods. In this study, inspired by the very recent work of weighted ASTRAL, we present weighted ASTRID, a variant of ASTRID that takes the branch uncertainty on the gene trees into account in the internode distance. Results Our experimental study evaluating weighted ASTRID typically shows improvements in accuracy compared to the original (unweighted) ASTRID, and shows competitive accuracy against weighted ASTRAL, the state of the art. Our re-implementation of ASTRID also improves the runtime, with marked improvements on large datasets. Conclusions Weighted ASTRID is a new and very fast method for species tree estimation that typically improves upon ASTRID and has comparable accuracy to weighted ASTRAL, while remaining much faster. Weighted ASTRID is available at https://github.com/RuneBlaze/internode.

WITCH-NG: efficient and accurate alignment of datasets with sequence length heterogeneity

Warnow

2023

Multiple sequence alignment (MSA) is a basic part of many bioinformatics pipelines, including in phylogeny estimation, prediction of structure for both RNAs and proteins, and metagenomic sequence analysis. Yet many sequence datasets exhibit substantial sequence length heterogeneity, both because of large insertions and deletions (indels) in the evolutionary history of the sequences and the inclusion of unassembled reads or incompletely assembled sequences in the input. A few methods have been developed that can be highly accurate in aligning datasets with sequence length heterogeneity, with UPP (Nguyen et al., Genome Biology 2015) one of the first methods to achieve good accuracy, and WITCH (Shen et al., Bioinformatics 2021) a recent improvement on UPP for accuracy, In this paper, we show how we can speed up WITCH. Our improvement includes replacing a critical step in WITCH (currently performed using a heuristic search) by a polynomial time exact algorithm using Smith-Waterman. Our new method, WITCH-NG (i.e., “next generation WITCH”) achieves the same accuracy but is substantially faster. WITCH-NG is available at https://github.com/RuneBlaze/WITCH-NG.