Milla Kortelainen scite author profile

Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning, to bioinformatics. In some applications we look for a decomposition that is optimal with respect to some property, such as the number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. In these cases, no optimization criteria guarantee the identification of the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition. In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs, leading to a practical algorithm for finding the complete set of safe paths. In addition, we evaluate our algorithm on RNA transcript data sets against a trivial safe algorithm (extended unitigs), the recently proposed safe paths for path covers (TCBB 2021) and the popular heuristic greedy-width . On the one hand, we found that besides maintaining perfect precision, our safe and complete algorithm reports a significantly higher coverage ( 50 more) compared with the other safe algorithms. On the other hand, the greedy-width algorithm although reporting a better coverage, it also reports a significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by 20 ) greedy-width on a unified metric (F-score) considering both coverage and precision when the evaluated data set has a significant number of complex graphs. Moreover, it also has a superior time ( 5 ) and space performance ( 2 2 2 ), resulting in a better and more practical approach for bioinformatic applications of flow decomposition.

show abstract

Safety and Completeness in Flow Decompositions for RNA Assembly

Khan¹,

Kortelainen²,

Cáceres³

et al. 2022

Preprint

View full text Add to dashboard Cite

Decomposing a network flow into weighted paths is a problem with numerous applications, ranging from networking, transportation planning to bioinformatics. In some applications we look for any decomposition that is optimal with respect to some property, such as number of paths used, robustness to edge deletion, or length of the longest path. However, in many bioinformatic applications, we seek a specific decomposition where the paths correspond to some underlying data that generated the flow. For realistic inputs, no optimization criteria can be guaranteed to uniquely identify the correct decomposition. Therefore, we propose to instead report the safe paths, which are subpaths of at least one path in every flow decomposition.Recently, Ma, Zheng, and Kingsford [WABI 2020] addressed the existence of multiple optimal solutions in a probabilistic framework, which is referred to as non-identifiability. In a follow-up work [RECOMB 2021], they gave a quadratic-time algorithm based on a global criterion for solving a problem called AND-Quant, which generalizes the problem of reporting whether a given path is safe.In this work, we give the first local characterization of safe paths for flow decompositions in directed acyclic graphs (DAGs), leading to a practical algorithm for finding the complete set of safe paths. We additionally evaluated our algorithms against the trivial safe algorithms (unitigs, extended unitigs) and the popularly used heuristic (greedy-width) for flow decomposition on RNA transcripts datasets. We find that despite maintaining perfect precision the safe and complete algorithm reports significantly higher coverage (≈50% more) as compared to trivial safe algorithms. The greedy-width algorithm though reporting a better coverage, reports significantly lower precision on complex graphs (for genes expressing a large number of transcripts). Overall, our safe and complete algorithm outperforms (by ≈20%) greedy-width on a unified metric (F-Score) considering both coverage and precision when the evaluated dataset has significant number of complex graphs. Moreover, it also has superior time (3−5×) and space efficiency (1.2−2.2×), resulting in a better and more practical approach for bioinformatics applications of flow decomposition.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.