Advances in DNA sequencing mean that databases of thousands of human genomes will soon be commonplace. In this paper, we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we pre-process the text with the lossless data compression algorithm LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show that this also significantly reduces query times.
We consider approximate string matching of a circular pattern consisting of the rotations of a pattern of length m . From SBNDM and Tuned Shift-Add, we derive a sublinear-time algorithm for searching a noncircular pattern with k allowed mismatches, which is extended to the problem of approximate circular pattern matching with k mismatches. We prove that the presented algorithms are average-optimal for m ⋅⌈log 2 ( k +1)+1 ⌉ = O ( w ), where w is the size of the computer word in bits. Experiments conducted under the aforementioned condition show that the new k -mismatches algorithm for circular strings outperforms previous solutions in practice. In particular, our algorithm is the first nonfiltering method for approximate circular string matching in sublinear average time, which makes it more suitable than earlier filtering methods for high error levels k / m and small alphabets.
Motif recognition is a challenging problem in bioinformatics due to the diversity of protein motifs. Many existing algorithms identify motifs of a given length, thus being either not applicable or not efficient when searching simultaneously for motifs of various lengths. Searching for gapped motifs, although very important, is a highly time-consuming task due to the combinatorial explosion of possible combinations implied by the consideration of long gaps. We introduce a new graph theoretical approach to identify motifs of various lengths, both with and without gaps. We compare our approach with two widely used methods: MEME and GLAM2 analyzing both the quality of the results and the required computational time. Our method provides results of a slightly higher level of quality than MEME but at a much faster rate, i.e., one eighth of MEME's query time. By using similarity indexing, we drop the query times down to an average of approximately one sixth of the ones required by GLAM2, while achieving a slightly higher level of quality of the results. More precisely, for sequence collections smaller than 50000 bytes GLAM2 is 13 times slower, while being at least as fast as our method on larger ones. The source code of our C++ implementation is freely available in GitHub: https://github.com/hirvolt1/debruijn-motif.
MIPS machine code is very structured: registers used before are likely to be used again, some instructions and registers are used more heavily than others, some instructions often follow each other and so on. Standard file compression utilities, such as gzip and bzip2, does not take full advantage of the structure because they work on byte-boundaries and don't see the underlying instruction fields. My idea is to filter opcodes, registers and immediates from MIPS binary code into distinct streams and compress them individually to achieve better compression ratios. Several different ways to split MIPS code into streams are considered. The results presented in this paper shows that a simple filter can reduce final compressed size by up to 10 % with gzip and bzip2.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.