The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other.In this paper we study the computational problem of computing the edit distance between a pair of strings where their distance is bounded by a parameter k ≪ n. We present two streaming algorithms for computing edit distance: One runs in time O(n + k 2 ) and the other n + O(k 3 ). By writing n + O(k 3 ) we want to emphasize that the number of operations per an input symbol is a small constant. In particular, the running time does not depend on the alphabet size, and the algorithm should be easy to implement.Previously a streaming algorithm with running time O(n + k 4 ) was given in the paper by the current authors (STOC'16). The best off-line algorithm runs in time O(n + k 2 ) (Landau et al., 1998) which is known to be optimal under the Strong Exponential Time Hypothesis. ogy, pattern recognition, text processing, information retrieval and many more. The edit distance between x and y, denoted by ∆(x, y), is defined as the minimum number of character insertions, deletions, and substitutions needed for converting x into y. Due to its immense applicability, the computational problem of computing the edit distance between two given strings x and y ∈ Σ n is of prime interest to researchers in various domains of computer science. Sometimes one also requires that the algorithm finds an alignment of x and y, i.e., a series of edit operations that transform x into y.In this paper we study the problem of computing edit distance of strings when given an a priori upper bound k ≪ n on their distance. This is akin to fixed parameter tractability. Arguably, the case when the edit distance is small relative to the length of the strings is the most interesting as when comparing two strings with respect to their edit distance we are implicitly making an assumption that the strings are similar. If they are not similar the edit distance is uninformative. There are few exceptions to this rule, most notably the reduction of instances of formula satisfiability (SAT) to instances of edit distance of exponentially large strings [BI15] where the edit distance of resulting strings is close to their length. However, such instance of the edit distance problem are rather artificial. For typical applications the edit distance of the two strings is much smaller then the length of the strings. Consider for example copying DNA during cell division: Human DNA is essentially a string of about 10 9 letters from {A, C, G, T }, and due to imperfections in the copying mechanism one can expect about 50 edit operations to occur during the process. So in many applications we can be looking for a handful of edit operations in large strings.Landau et al.[LMS98] provided an algorithm that runs in time O(n + k 2 ) and uses space O(n) when size of the alphabet Σ is constant. In general the running time of the algorithm given in [LMS98] is O(n · mi...
Edit distance is a measure of similarity of two strings based on the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. The edit distance can be computed exactly using a dynamic programming algorithm that runs in quadratic time. Andoni, Krauthgamer and Onak (2010) gave a nearly linear time algorithm that approximates edit distance within approximation factor poly(log n).In this paper, we provide an algorithm with running time O(n 2−2/7 ) that approximates the edit distance within a constant factor.
The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. A simple dynamic programming computes the edit distance between two strings of length n in O(n 2 ) time, and a more sophisticated algorithm runs in time O(n + t 2 ) when the edit distance is t [Landau, Myers and Schmidt, SICOMP 1998]. In pursuit of obtaining faster running time, the last couple of decades have seen a flurry of research on approximating edit distance, including polylogarithmic approximation in nearlinear time [Andoni, Krauthgamer and Onak, FOCS 2010], and a constant-factor approximation in subquadratic time [Chakrabarty, Das, Goldenberg, Koucký and Saks, FOCS 2018].We study sublinear-time algorithms for small edit distance, which was investigated extensively because of its numerous applications. Our main result is an algorithm for distinguishing whether the edit distance is at most t or at least t 2 (the quadratic gap problem) in timeÕ( n t + t 3 ). This time bound is sublinear roughly for all t in [ω(1), o(n 1/3 )], which was not known before. The best previous algorithms solve this problem in sublinear time only for t = ω(n 1/3 ) [Andoni and Onak, STOC 2009].Our algorithm is based on a new approach that adaptively switches between uniform sampling and reading contiguous blocks of the input strings. In contrast, all previous algorithms choose which coordinates to query non-adaptively. Moreover, it can be extended to solve the t vs t 2−ǫ gap problem in timeÕ( n t 1−ǫ + t 3 ).Previous Work Batu et al.'s algorithm distinguishes t = n α vs f (t) = Ω(n) in O(n max{2α−1,α/2} ) time for any fixed α > 1 [BEK + 03]. Their approach crucially depends on f (t) = Ω(n) and cannot distinguish between (say) n 0.1 and n 0.99 . The best sublinear-time algorithm known for gap edit distance, by Andoni and Onak [AO09], distinguishes between t = n α vs f (t) = n β for β > α in time O(n 2+α−2β+o(1) ). For the quadratic gap problem, i.e., β = 2α, this time bound is O(n 2−3α+o(1) ), which becomes worse as t gets smaller (as discussed earlier). For example, when t = n 1/4 , the known algorithm is not sublinear, whereas ours runs in timeÕ(n 3/4 ). Presence of repeated patterns make the gap edit distance problem significantly difficult to approximate. When no repetition is allowed, the state-of-the-art sublinear-time algorithms of [AN10] 1 Throughout, the tilde notationÕ(·) andω(·) hide factors that are polylogarithmic in n.
We study edit distance computation with preprocessing: the preprocessing algorithm acts on each string separately, and then the query algorithm takes as input the two preprocessed strings. This model is inspired by scenarios where we would like to compute edit distance between many pairs in the same pool of strings.Our results include:Permutation-LCS If the LCS between two permutations has length n − k, we can compute it exactly with O(n log(n)) preprocessing and O(k log(n)) query time.Small edit distance For general strings, if their edit distance is at most k, we can compute it exactly with O(n log(n)) preprocessing and O(k 2 log(n)) query time.Approximate edit distance For the most general input, we can approximate the edit distance to within factor (7+o(1)) with preprocessing timeÕ(n 2 ) and query timeÕ(n 1.5+o( 1) ).All of these results significantly improve over the state of the art in edit distance computation without preprocessing. Interestingly, by combining ideas from our algorithms with preprocessing, we provide new improved results for approximating edit distance without preprocessing in subquadratic time.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.