Tommi Hirvola scite author profile

Phil. Trans. R. Soc. A.

et al. 2014

Advances in DNA sequencing mean that databases of thousands of human genomes will soon be commonplace. In this paper, we introduce a simple technique for reducing the size of conventional indexes on such highly repetitive texts. Given upper bounds on pattern lengths and edit distances, we pre-process the text with the lossless data compression algorithm LZ77 to obtain a filtered text, for which we store a conventional index. Later, given a query, we find all matches in the filtered text, then use their positions and the structure of the LZ77 parse to find all matches in the original text. Our experiments show that this also significantly reduces query times.

Approximate Online Matching of Circular Strings

Tarhio

2014

Bit-Parallel Approximate Matching of Circular Strings with k Mismatches

ACM J. Exp. Algorithmics

Tarhio

2017

We consider approximate string matching of a circular pattern consisting of the rotations of a pattern of length m . From SBNDM and Tuned Shift-Add, we derive a sublinear-time algorithm for searching a noncircular pattern with k allowed mismatches, which is extended to the problem of approximate circular pattern matching with k mismatches. We prove that the presented algorithms are average-optimal for m ⋅⌈log 2 ( k +1)+1 ⌉ = O ( w ), where w is the size of the computer word in bits. Experiments conducted under the aforementioned condition show that the new k -mismatches algorithm for circular strings outperforms previous solutions in practice. In particular, our algorithm is the first nonfiltering method for approximate circular string matching in sublinear average time, which makes it more suitable than earlier filtering methods for high error levels k / m and small alphabets.

A Graph-Theoretical Approach for Motif Discovery in Protein Sequences

Czeizler

IEEE/ACM Trans. Comput. Biol. and Bioinf.

Karhu

2017

Motif recognition is a challenging problem in bioinformatics due to the diversity of protein motifs. Many existing algorithms identify motifs of a given length, thus being either not applicable or not efficient when searching simultaneously for motifs of various lengths. Searching for gapped motifs, although very important, is a highly time-consuming task due to the combinatorial explosion of possible combinations implied by the consideration of long gaps. We introduce a new graph theoretical approach to identify motifs of various lengths, both with and without gaps. We compare our approach with two widely used methods: MEME and GLAM2 analyzing both the quality of the results and the required computational time. Our method provides results of a slightly higher level of quality than MEME but at a much faster rate, i.e., one eighth of MEME's query time. By using similarity indexing, we drop the query times down to an average of approximately one sixth of the ones required by GLAM2, while achieving a slightly higher level of quality of the results. More precisely, for sequence collections smaller than 50000 bytes GLAM2 is 13 times slower, while being at least as fast as our method on larger ones. The source code of our C++ implementation is freely available in GitHub: https://github.com/hirvolt1/debruijn-motif.

MIPS code compression

Hirvola¹

2012

Preprint

MIPS machine code is very structured: registers used before are likely to be used again, some instructions and registers are used more heavily than others, some instructions often follow each other and so on. Standard file compression utilities, such as gzip and bzip2, does not take full advantage of the structure because they work on byte-boundaries and don't see the underlying instruction fields. My idea is to filter opcodes, registers and immediates from MIPS binary code into distinct streams and compress them individually to achieve better compression ratios. Several different ways to split MIPS code into streams are considered. The results presented in this paper shows that a simple filter can reduce final compressed size by up to 10 % with gzip and bzip2.