A basic task in bioinformatics is the counting of k-mers in genome strings. The k-mer counting problem is to build a histogram of all substrings of length k in a given genome sequence. We present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important. While existing k-mer counting tools suffer from excessive memory resource consumption or degrading performance for large k, Gerbil is able to efficiently support large k without much loss of performance. Our software implements a two-disk approach. In the first step, DNA reads are loaded from disk and distributed to temporary files that are stored at a working disk. In a second step, the temporary files are read again, split into k-mers and counted via a hash table approach. In addition, Gerbil can optionally use GPUs to accelerate the counting step. For large k, we outperform state-of-the-art open source k-mer counting tools for large genome data sets.
BackgroundA basic task in bioinformatics is the counting of k-mers in genome sequences. Existing k-mer counting tools are most often optimized for small k < 32 and suffer from excessive memory resource consumption or degrading performance for large k. However, given the technology trend towards long reads of next-generation sequencers, support for large k becomes increasingly important.ResultsWe present the open source k-mer counting software Gerbil that has been designed for the efficient counting of k-mers for k ≥ 32. Our software is the result of an intensive process of algorithm engineering. It implements a two-step approach. In the first step, genome reads are loaded from disk and redistributed to temporary files. In a second step, the k-mers of each temporary file are counted via a hash table approach. In addition to its basic functionality, Gerbil can optionally use GPUs to accelerate the counting step. In a set of experiments with real-world genome data sets, we show that Gerbil is able to efficiently support both small and large k.ConclusionsWhile Gerbil’s performance is comparable to existing state-of-the-art open source k-mer counting tools for small k < 32, it vastly outperforms its competitors for large k, thereby enabling new applications which require large values of k.Electronic supplementary materialThe online version of this article (doi:10.1186/s13015-017-0097-9) contains supplementary material, which is available to authorized users.
We consider the problem of constructing a bipartite graph whose degrees lie in prescribed intervals. Necessary and sufficient conditions for the existence of such graphs are well-known. However, existing realization algorithms suffer from large running times. In this paper, we present a realization algorithm that constructs an appropriate bipartite graph G = (U, V, E) in O(|U | + |V | + |E|) time, which is asymptotically optimal. In addition, we show that our algorithm produces edge-minimal bipartite graphs and that it can easily be modified to construct edgemaximal graphs.
We introduce the decision support tool PANDA (Passenger Aware Novel Dispatching Assistance). Our web-based tool is designed to provide train dispatchers with detailed real-time information about the current passenger flow and the multidimensional impact of waiting decisions in case of train delays. After presenting the algorithmic background and PANDA's main features, we show how it can be utilized in a typical use case scenario for train dispatchers. Besides its practical value for train dispatchers, the framework can be used to systematically study scientific questions. Exemplarily, we use our software to experimentally analyse the influence of waiting decisions on realistic passenger flows of Deutsche Bahn. In a first experiment, we evaluate PANDA's potential benefit for passengers. Our findings indicate that a remarkable reduction in total delay might be possible in comparison to current practice. In two additional experiments, we investigate the timing aspect of waiting decisions. Our observations suggest that the timing of waiting decisions is of crucial importance and that a carefully implemented early rerouting strategy has a significant potential to reduce resulting delays of passengers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.