Chris Armen scite author profile

Stein

1995

Abstract. Given a collection of strings S = {sl,..., sn} over an alphabet Z, a superstring ~ of S is a string containing each si as a substring; that is, for each i, 1 < i < n, a contains a block of Isil consecutive characters that match si exactly. The shortest superstring problem is the problem of finding a superstring ol of minimum length. This problem is NP-hard [6] and has applications in computational biology and data compression. The first O(1)-approximation algorithms were given in [2]. We describe our 2~-approximation algorithm, which is the best known. While our algorithm is not complex, our analysis requires some novel machinery to describe overlapping periodic strings. We then show how to combine our result with that of [11] to obtain a ratio of 2-~ ~ 2.725. We describe an implementation of our algorithm which runs in O(tS I + n 3) time; this matches the running time of previous O(1)-approximations.

Journal of Computational Biology

Short Superstrings and the Structure of Overlapping Strings

Armen¹,

Stein²

1995

Given a collection of strings S = [s1,...,sn] over an alphabet sigma, a superstring alpha of S is a string containing each si as a substring, that is, for each i, 1 < or = i < or = n, alpha contains a block of magnitude of si consecutive characters that match si exactly. The shortest superstring problem is the problem of finding a superstring alpha of minimum length. The shortest superstring problem has applications in both computational biology and data compression. The shortest superstring problem is NP-hard (Gallant et al., 1980); in fact, it was recently shown to be MAX SNP-hard (Blum et al., 1994). Given the importance of the applications, several heuristics and approximation algorithms have been proposed. Constant factor approximation algorithms have been given in Blum et al. (1994) (factor of 3), Teng and Yao (1993) (factor of 2 8/9), Czumaj et al. (1994) (factor of 2 5/6), and Kosaraju et al. (1994) (factor of 2 50/63). Informally, the key to any algorithm for the shortest superstring problem is to identify sets of strings with large amounts of similarity, or overlap. Although the previous algorithms and their analyses have grown increasingly sophisticated, they reveal remarkably little about the structure of strings with large amounts of overlap. In this sense, they are solving a more general problem than the one at hand. In this paper, we study the structure of strings with large amounts of overlap and use our understanding to give an algorithm that finds a superstring whose length is no more than 2 3/4 times that of the optimal superstring. Our algorithm runs in O(magnitude of S + n3) time, which matches that of previous algorithms. We prove several interesting properties about short periodic strings, allowing us to answer questions of the following form: Given a string with some periodic structure, characterize all the possible periodic strings that can have a large amount of overlap with the first string.

A 2 2/3-approximation algorithm for the shortest superstring problem

Stein

1996

Given a collection of strings S = fs 1 ; : : :; s n g over an alphabet , a superstring of S is a string containing each s i as a substring; that is, for each i, 1 i n, contains a block of js i j consecutive characters that match s i exactly. The shortest superstring problem is the problem of nding a superstring of minimum length. The shortest superstring problem has applications in both data compression and computational biology. In data compression, the problem is a part of a general model of string compression proposed by Gallant, Maier and Storer (JCSS '80). Much of the recent interest in the problem is due to its application to DNA sequence assembly.The problem has been shown to be NP-hard; in fact, it was shown by Blum et al.(JACM '94) to be MAX SNP-hard. The rst O(1)-approximation was also due to Blum et al., who gave an algorithm that always returns a superstring no more than 3 times the length of an optimal solution. Several researchers have published results that improve on the approximation ratio; of these, the best previous result is our algorithm ShortString, which achieves a 2 3 4 {approximation (WADS '95).We present our new algorithm, G-ShortString, which achieves a ratio of 2 2 3 . It generalizes the ShortString algorithm, but the analysis di ers substantially from that of ShortString. Our previous work identi ed classes of strings that have a nested periodic structure, and which must be present in the worst case for our algorithms. We introduced machinery to descibe these strings and proved strong structural properties about them. In this paper we extend this study to strings that exhibit a more relaxed form of the same structure, and we use this understanding to obtain our improved result.

A 2 superstring approximation algorithm

Discrete Applied Mathematics

Stein

1998

Approximation algorithms for the shortest superstring problem.

Given a collection of strings S = fs 1 ; : : :; s n g over an alphabet , a superstring of S is a string containing each s i as a substring; that is, for each i, 1 i n, contains a block of js i j consecutive characters that match s i exactly. The shortest superstring problem is the problem of nding a superstring of minimum length.The shortest superstring problem has applications in both data compression and computational biology. In data compression, the problem is a part of a general model of string compression proposed by Gallant, Maier and Storer (JCSS '80). Much of the recent interest in the problem is due to its application to DNA sequence assembly.The problem has been shown to be NP-hard; in fact, it was shown by Blum et al.(JACM '94) to be MAX SNP-hard. The rst O(1)-approximation was also due to Blum et al., who gave an algorithm that always returns a superstring no more than 3 times the length of an optimal solution. Several researchers have published results that improve on the approximation ratio; of these, the best previous result is our algorithm ShortString, which achieves a 2 3 4 {approximation (WADS '95).We present our new algorithm, G-ShortString, which achieves a ratio of 2 2 3 . It generalizes the ShortString algorithm, but the analysis diers substantially from that of ShortString. Our previous work identied classes of strings that have a nested periodic structure, and which must be present in the worst case for our algorithms. We introduced machinery to descibe these strings and proved strong structural properties about them. In this paper we extend this study to strings that exhibit a more relaxed form of the same structure, and we use this understanding to obtain our improved result.