DNA sequence compression using the Burrows-Wheeler Transform

Adjeroh, Donald A.; Zhang, Y.; Mukherjee, Amar; Powell, Matthew J.; Bell, Timothy C.

doi:10.1109/csb.2002.1039352

Cited by 43 publications

(34 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We can easily define parallel arrays to also point to the position of the longest factor to permit easy access to these factors. Direct applications of our introduced data structures may include pattern substitution, detecting duplication [6], LZ decomposition in text compression [41], studying periodicity in strings [32,39], biological sequence compression [3,21], and analysis of repetition structures in DNA sequences [22,2]. Specifically, our pLF data structure may be used to identify how to best substitute a pattern or even determine if duplication is "hidden" by reversal or with parameterization.…”

Section: Discussionmentioning

confidence: 99%

Variations of the parameterized longest previous factor

Beal

Adjeroh

2012

Journal of Discrete Algorithms

View full text Add to dashboard Cite

The parameterized longest previous factor (pLPF) problem as defined for parameterized strings (p-strings) adds a level of parameterization to the longest previous factor (LPF) problem originally defined for traditional strings. In this work, we consider the construction of the pLPF data structure and identify the strong relationship between the pLPF linear time construction and several variations of the problem. Initially, we propose a taxonomy of classes for longest factor problems. Using this taxonomy, we show an interesting connection between the pLPF and popular data structures. It is shown that a subset of longest factor problems may be created with the pLPF construction. More specifically, the pLPF problem is used as a foundation to achieve the linear time construction of popular data structures such as the LCP, parameterized-LCP (pLCP), parameterized-border (p-border) array, and border array. We further generalize the permuted-LCP for p-strings and provide a linear time construction. A number of new variations of the pLPF problem are proposed and addressed in linear time for both p-strings and traditional strings, including the longest not-equal factor (LneF), longest reverse factor (LrF), and longest factor (LF). The framework of the pLPF construction is exploited to efficiently address a multitude of data structures with prospects in various applications. Finally, we implement our algorithms and perform various experiments to confirm theoretical results.

show abstract

Section: Discussionmentioning

confidence: 99%

Variations of the parameterized longest previous factor

Beal

Adjeroh

2012

Journal of Discrete Algorithms

View full text Add to dashboard Cite

show abstract

“…Recently, some researches in lossless compression methods commonly aim to optimize existing compression method for specific data type [7][8][9][10][11][12][13][14][15][16][17][18] or to improve the existing compression method by transforming data to other form before compression process or by combining several compression method [19][20][21][22]. One of novel researches in compression method is Asymmetric Numerical System (ANS) [23][24].…”

Section: New Lossless Compression Methods Using Crlcm (Hendra Mesra)mentioning

confidence: 99%

“…The selection of K is based on frequency distribution of the difference (d) of iteration number (i) and its predictions (p) in (7). As example, the frequency distribution of the difference value on Lena image is shown on Figure 7.…”

mentioning

confidence: 99%

New Lossless Compression Method using Cyclic Reversible Low Contrast Mapping (CRLCM)

Mesra¹,

Tjandrasa²,

Fatichah³

2016

IJECE

View full text Add to dashboard Cite

<p>In general, the compression method is developed to reduce the redundancy of data. This study uses a different approach to embed some bits of datum in image data into other datum using a Reversible Low Contrast Mapping (RLCM) transformation. Besides using the RLCM for embedding, this method also applies the properties of RLCM to compress the datum before it is embedded. In its algorithm, the proposed method engages Queue and Recursive Indexing. The algorithm encodes the data in a cyclic manner. In contrast to RLCM, the proposed method is a coding method as Huffman coding. This research uses publicly available image data to examine the proposed method. For all testing images, the proposed method has higher compression ratio than the Huffman coding.</p>

show abstract

“…DNA sequence can be very huge. For example, the human genome contains about 3.1647 billion DNA base pairs [1]. Searching patterns in the DNA sequences databases is usually the first and crucial step in DNA related research, such as DNA sequence alignment.…”

Section: Introductionmentioning

confidence: 99%

“…The major reason is the fact that these methods never consider certain special characteristics of biological sequences. On the contrary, the algorithms, which consider the different regularities or repetition structures that are inherent in DNA sequence, make great success [1]. BIOCOMPRESS [2], GENCOMPRESS [3] and BWT-base [4] compress are the outstanding algorithms.…”

Section: Introductionmentioning

confidence: 99%

Compressed Pattern Matching in DNA Sequences Using Multithreaded Technology

Lin

Liu

Zhang

et al. 2009

2009 3rd International Conference on Bioinformatics and Biomedical Engineering

View full text Add to dashboard Cite

Compressed pattern matching on large DNA sequences data is very important in bioinformatics. In this paper, in order to improve the performance by searching pattern in parallel time, multithreaded programming technique is used. Then，two novel multithreaded algorithms are proposed, named MTd-BM and MTd-Horspool. The first one is a mutation of d-BM algorithm, which is based on Boyer-Moore method. And the second one is designed in the similitude of MTd-BM, but using Horspool method as its foundation. The experimental results show that these two algorithms are nearly 2 times faster than the d-BM algorithm for long DNA pattern (length>50). Moreover, compression of DNA sequences gives a guaranteed space saving of 75%.

show abstract

DNA sequence compression using the Burrows-Wheeler Transform

Cited by 43 publications

References 30 publications

Variations of the parameterized longest previous factor

Variations of the parameterized longest previous factor

New Lossless Compression Method using Cyclic Reversible Low Contrast Mapping (CRLCM)

Compressed Pattern Matching in DNA Sequences Using Multithreaded Technology

Contact Info

Product

Resources

About