Mingfu Shao scite author profile

We introduce Scallop, an accurate reference-based transcript assembler that improves reconstruction of multi-exon and lowly expressed transcripts. Scallop preserves long-range phasing paths extracted from reads, while producing a parsimonious set of transcripts and minimizing coverage deviation. On 10 human RNA-seq samples, Scallop produces 34.5% and 36.3% more correct multi-exon transcripts than StringTie and TransComb, and respectively identifies 67.5% and 52.3% more lowly expressed transcripts. Scallop achieves higher sensitivity and precision than previous approaches over a wide range of coverage thresholds.

show abstract

An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes

Shao

Lin

Moret

2015

Journal of Computational Biology

View full text Add to dashboard Cite

Computing the edit distance between two genomes is a basic problem in the study of genome evolution. The double-cut-and-join (DCJ) model has formed the basis for most algorithmic research on rearrangements over the last few years. The edit distance under the DCJ model can be computed in linear time for genomes without duplicate genes, while the problem becomes NP-hard in the presence of duplicate genes. In this article, we propose an integer linear programming (ILP) formulation to compute the DCJ distance between two genomes with duplicate genes. We also provide an efficient preprocessing approach to simplify the ILP formulation while preserving optimality. Comparison on simulated genomes demonstrates that our method outperforms MSOAR in computing the edit distance, especially when the genomes contain long duplicated segments. We also apply our method to assign orthologous gene pairs among human, mouse, and rat genomes, where once again our method outperforms MSOAR.

show abstract

Theory and A Heuristic for the Minimum Path Flow Decomposition Problem

Shao

Kingsford

2019

IEEE/ACM Trans. Comput. Biol. and Bioinf.

View full text Add to dashboard Cite

Motivated by multiple genome assembly problems and other applications, we study the following minimum path flow decomposition problem: given a directed acyclic graph with source and sink and a flow , compute a set of paths and assign weight for such that , and is minimized. We develop some fundamental theory for this problem, upon which we design an efficient heuristic. Specifically, we prove that the gap between the optimal number of paths and a known upper bound is determined by the nontrivial equations within the flow values. This result gives rise to the framework of our heuristic: to iteratively reduce the gap through identifying such equations. We also define an operation on certain independent substructures of the graph, and prove that this operation does not affect the optimality but can transform the graph into one with desired property that facilitates reducing the gap. We apply and test our algorithm on both simulated random instances and perfect splice graph instances, and also compare it with the existing state-of-art algorithm for flow decomposition. The results illustrate that our algorithm can achieve very high accuracy on these instances, and also that our algorithm significantly improves on the previous algorithms. An implementation of our algorithm is freely available at https://github.com/Kingsford-Group/catfish.

show abstract

SQUID: Transcriptomic Structural Variation Detection from RNA-seq

Shao

Kingsford

2017

Preprint

View full text Add to dashboard Cite

Transcripts are frequently modified by structural variations, which leads to a fused transcript of either multiple genes (known as a fusion gene) or a gene and a previously non-transcribing sequence. Detecting these modifications (called transcriptomic structural variations, or TSVs), especially in cancer tumor sequencing, is an important and challenging computational problem. We introduce SQUID, a novel algorithm to accurately predict both fusion-gene and non-fusion-gene TSVs from RNA-seq alignments. SQUID unifies both concordant and discordant read alignments into one model, and doubles the accuracy on simulation data compared to other approaches. With SQUID, we identified novel non-fusion-gene TSVs on TCGA samples.

show abstract

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

Shao

Lin

Moret

2014

View full text Add to dashboard Cite

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Mingfu Shao

Accurate assembly of transcripts through phase-preserving graph decomposition

An Exact Algorithm to Compute the Double-Cut-and-Join Distance for Genomes with Duplicate Genes

Theory and A Heuristic for the Minimum Path Flow Decomposition Problem

SQUID: Transcriptomic Structural Variation Detection from RNA-seq

An Exact Algorithm to Compute the DCJ Distance for Genomes with Duplicate Genes

Contact Info

Product

Resources

About