Streaming algorithms for embedding and computing edit distance in the low distance regime

Chakraborty, Diptarka; Goldenberg, Elazar; Koucký, Michal

doi:10.1145/2897518.2897577

Cited by 57 publications

(108 citation statements)

References 32 publications

Supporting

Mentioning

108

Contrasting

Order By: Relevance

“…Independently developed fuzzy extractors [9] can also be seen as providing a document exchange scheme for some k polynomially small in n. A randomized scheme by Jowhari [10] independently achieved a size of O(k log n log * n). In two recent break-throughs Chakraborty, Goldenberg, and Kouckỳ [11] designed a low distortion embedding from edit distance to hamming distance which can be used to get a summary of size Θ(k 2 log n) and Bellazougi and Zhang [2] further build on this randomized embedding and achieved a scheme with summary size Θ(k log 2 k + k log n) which is order optimal for 4 k = exp( √ log n). All of these schemes are randomized.…”

Section: Introductionmentioning

confidence: 99%

Optimal Document Exchange and New Codes for Insertions and Deletions

Haeupler

2019

2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS)

View full text Add to dashboard Cite

show abstract

Section: Introductionmentioning

confidence: 99%

Optimal Document Exchange and New Codes for Insertions and Deletions

Haeupler

2019

2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS)

View full text Add to dashboard Cite

show abstract

“…for the Ulam metric (edit distance with no repetition, which obviously requires a large alphabet) distinguish between t vs Θ(t) in O( n t + √ n) time, achieving a bound that is similar to the folklore sampling algorithm for approximating Hamming distance. There is a long line of work on edit distance and related problems, aiming to achieve fast running time [AN10, AIKH13, Sah17, BEG + 18, HSSS19], low distortion embedding [OR07, KR06,CGK16b,BZ16], small space complexity [CGK16b, BZ16, BJKK04] and parallel algorithms [HSS19]. The work of Andoni, Onak and Krauthgamer [AKO10] achieves a sublinear asymmetric query complexity for approximating edit distance; however it does not lead to any sublinear time algorithm since one of the strings must be read in its entirety.…”

Section: Resultsmentioning

confidence: 99%

“…Their algorithm computes a constant-size sketch but still requires a linear pass over the data. This result was later improved to hold for general strings [CGK16b] via embedding into Hamming distance, but again in linear time.…”

Section: What Is the Right Gap?mentioning

confidence: 98%

Sublinear Algorithms for Gap Edit Distance

Goldenberg

Krauthgamer

Saha

2019

2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS)

Self Cite

View full text Add to dashboard Cite

The edit distance is a way of quantifying how similar two strings are to one another by counting the minimum number of character insertions, deletions, and substitutions required to transform one string into the other. A simple dynamic programming computes the edit distance between two strings of length n in O(n 2 ) time, and a more sophisticated algorithm runs in time O(n + t 2 ) when the edit distance is t [Landau, Myers and Schmidt, SICOMP 1998]. In pursuit of obtaining faster running time, the last couple of decades have seen a flurry of research on approximating edit distance, including polylogarithmic approximation in nearlinear time [Andoni, Krauthgamer and Onak, FOCS 2010], and a constant-factor approximation in subquadratic time [Chakrabarty, Das, Goldenberg, Koucký and Saks, FOCS 2018].We study sublinear-time algorithms for small edit distance, which was investigated extensively because of its numerous applications. Our main result is an algorithm for distinguishing whether the edit distance is at most t or at least t 2 (the quadratic gap problem) in timeÕ( n t + t 3 ). This time bound is sublinear roughly for all t in [ω(1), o(n 1/3 )], which was not known before. The best previous algorithms solve this problem in sublinear time only for t = ω(n 1/3 ) [Andoni and Onak, STOC 2009].Our algorithm is based on a new approach that adaptively switches between uniform sampling and reading contiguous blocks of the input strings. In contrast, all previous algorithms choose which coordinates to query non-adaptively. Moreover, it can be extended to solve the t vs t 2−ǫ gap problem in timeÕ( n t 1−ǫ + t 3 ).Previous Work Batu et al.'s algorithm distinguishes t = n α vs f (t) = Ω(n) in O(n max{2α−1,α/2} ) time for any fixed α > 1 [BEK + 03]. Their approach crucially depends on f (t) = Ω(n) and cannot distinguish between (say) n 0.1 and n 0.99 . The best sublinear-time algorithm known for gap edit distance, by Andoni and Onak [AO09], distinguishes between t = n α vs f (t) = n β for β > α in time O(n 2+α−2β+o(1) ). For the quadratic gap problem, i.e., β = 2α, this time bound is O(n 2−3α+o(1) ), which becomes worse as t gets smaller (as discussed earlier). For example, when t = n 1/4 , the known algorithm is not sublinear, whereas ours runs in timeÕ(n 3/4 ). Presence of repeated patterns make the gap edit distance problem significantly difficult to approximate. When no repetition is allowed, the state-of-the-art sublinear-time algorithms of [AN10] 1 Throughout, the tilde notationÕ(·) andω(·) hide factors that are polylogarithmic in n.

show abstract

“…Then, random number u returned from F unc F is converted to a random number from the Cauchy distribution in F unc H as tan(π · (u − 0.5))/β at line 8. [2], [33] is another string embedding using a randomized algorithm. Let S i for i = 1,2,...,N be input strings of alphabet Σ and let L be the maximum length of input strings.…”

Section: Scalable Alignment Kernelsmentioning

confidence: 99%

Space-Efficient Feature Maps for String Alignment Kernels

Tabei

Yamanishi

Pagh

2019

2019 IEEE International Conference on Data Mining (ICDM)

View full text Add to dashboard Cite

String kernels are attractive data analysis tools for analyzing string data. Among them, alignment kernels are known for their high prediction accuracies in string classifications when tested in combination with SVM in various applications. However, alignment kernels have a crucial drawback in that they scale poorly due to their quadratic computation complexity in the number of input strings, which limits large-scale applications in practice. We address this need by presenting the first approximation for string alignment kernels, which we call spaceefficient feature maps for edit distance with moves (SFMEDM), by leveraging a metric embedding named edit sensitive parsing (ESP) and feature maps (FMs) of random Fourier features (RFFs) for large-scale string analyses. The original FMs for RFFs consume a huge amount of memory proportional to the dimension d of input vectors and the dimension D of output vectors, which prohibits its large-scale applications. We present novel spaceefficient feature maps (SFMs) of RFFs for a space reduction from O(dD) of the original FMs to O(d) of SFMs with a theoretical guarantee with respect to concentration bounds. We experimentally test SFMEDM on its ability to learn SVM for large-scale string classifications with various massive string data, and we demonstrate the superior performance of SFMEDM with respect to prediction accuracy, scalability and computation efficiency.

show abstract

Streaming algorithms for embedding and computing edit distance in the low distance regime

Cited by 57 publications

References 32 publications

Optimal Document Exchange and New Codes for Insertions and Deletions

Optimal Document Exchange and New Codes for Insertions and Deletions

Sublinear Algorithms for Gap Edit Distance

Space-Efficient Feature Maps for String Alignment Kernels

Contact Info

Product

Resources

About