2014
DOI: 10.1007/s00778-014-0370-1
|View full text |Cite
|
Sign up to set email alerts
|

ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

Abstract: Modern applications, including bioinformatics, time series, and web log analysis, require the extraction of frequent patterns, called motifs, from one very long (i.e., several gigabytes) sequence. Existing approaches are either heuristics that are error-prone, or exact (also called combinatorial) methods that are extremely slow, therefore, applicable only to very small sequences (i.e., in the order of megabytes). This paper presents ACME, a combinatorial approach that scales to gigabyte-long sequences and is t… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
7
0

Year Published

2015
2015
2018
2018

Publication Types

Select...
4
2
1

Relationship

1
6

Authors

Journals

citations
Cited by 13 publications
(7 citation statements)
references
References 27 publications
0
7
0
Order By: Relevance
“…A natural choice to this end is to employ sampling, e.g., as in [13], [14]. However, sampling-based automated profile generation seems to be a particularly challenging task in Spark.…”
Section: Discussion On the Provision Of End-to-end Solutionsmentioning
confidence: 99%
See 1 more Smart Citation
“…A natural choice to this end is to employ sampling, e.g., as in [13], [14]. However, sampling-based automated profile generation seems to be a particularly challenging task in Spark.…”
Section: Discussion On the Provision Of End-to-end Solutionsmentioning
confidence: 99%
“…However, all these cost modeling and profiling techniques do not cover specific phenomena in Spark execution, such as super-linear speed-ups for small degrees of parallelism and performance degradation for large ones. The proposals in [13], [14] present a sampling-based approach to estimate the profile of a single embarrassingly parallel task, based on the behavior of some of its partitions. However, they assume that partitions are scheduled in multiple waves, whereas we have adopted a configuration, where all partitions are scheduled in a single wave but there are multiple interdependent tasks.…”
Section: Related Workmentioning
confidence: 99%
“…We first seek to find all maximal repeated subsequences of states in S ′ . A maximal subsequence is defined as a sequence that cannot be extended to either the left or right without changing the set of occurrences in S ′ [26]. We require that each repeated subsequence has at least L non-overlapping instances in S ′ .…”
Section: E-step A: Discover Candidate Motifsmentioning
confidence: 99%
“…Motif discovery is a common problem in time series data analysis [6]. Methods for finding motifs include random projection [4] and suffix arrays [26]. Some of these methods are event rather than numerically based and thus bypass the simultaneous problem of state assignment [23].…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation