ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

Sahli, Majed; Mansour, Essam; Kalnis, Panos

doi:10.1007/s00778-014-0370-1

Cited by 13 publications

(7 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A natural choice to this end is to employ sampling, e.g., as in [13], [14]. However, sampling-based automated profile generation seems to be a particularly challenging task in Spark.…”

Section: Discussion On the Provision Of End-to-end Solutionsmentioning

confidence: 99%

“…However, all these cost modeling and profiling techniques do not cover specific phenomena in Spark execution, such as super-linear speed-ups for small degrees of parallelism and performance degradation for large ones. The proposals in [13], [14] present a sampling-based approach to estimate the profile of a single embarrassingly parallel task, based on the behavior of some of its partitions. However, they assume that partitions are scheduled in multiple waves, whereas we have adopted a configuration, where all partitions are scheduled in a single wave but there are multiple interdependent tasks.…”

Section: Related Workmentioning

confidence: 99%

See 1 more Smart Citation

Dynamic Configuration of Partitioning in Spark Applications

Gounaris

Kougka

Tous

et al. 2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

Abstract-Spark has become one of the main options for large-scale analytics running on top of shared-nothing clusters. This work aims to make a deep dive into the parallelism configuration and shed light on the behavior of parallel spark jobs. It is motivated by the fact that running a Spark application on all the available processors does not necessarily imply lower running time, while may entail waste of resources. We first propose analytical models for expressing the running time as a function of the number of machines employed. We then take another step, namely to present novel algorithms for configuring dynamic partitioning with a view to minimizing resource consumption without sacrificing running time beyond a user-defined limit. The problem we target is NP-hard. To tackle it, we propose a greedy approach after introducing the notions of dependency graphs and of the benefit from modifying the degree of partitioning at a stage; complementarily, we investigate a randomized approach. Our polynomial solutions are capable of judiciously use the resources that are potentially at user's disposal and strike interesting trade-offs between running time and resource consumption. Their efficiency is thoroughly investigated through experiments based on real execution data.

show abstract

“…A natural choice to this end is to employ sampling, e.g., as in [13], [14]. However, sampling-based automated profile generation seems to be a particularly challenging task in Spark.…”

Section: Discussion On the Provision Of End-to-end Solutionsmentioning

confidence: 99%

Section: Related Workmentioning

confidence: 99%

Dynamic Configuration of Partitioning in Spark Applications

Gounaris

Kougka

Tous

et al. 2017

IEEE Trans. Parallel Distrib. Syst.

View full text Add to dashboard Cite

show abstract

“…We first seek to find all maximal repeated subsequences of states in S ′ . A maximal subsequence is defined as a sequence that cannot be extended to either the left or right without changing the set of occurrences in S ′ [26]. We require that each repeated subsequence has at least L non-overlapping instances in S ′ .…”

Section: E-step A: Discover Candidate Motifsmentioning

confidence: 99%

“…Motif discovery is a common problem in time series data analysis [6]. Methods for finding motifs include random projection [4] and suffix arrays [26]. Some of these methods are event rather than numerically based and thus bypass the simultaneous problem of state assignment [23].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

MASA: Motif-Aware State Assignment in Noisy Time Series Data

Jain,

Hallac,

Sosic

et al. 2018

Preprint

View full text Add to dashboard Cite

Complex systems, such as airplanes, cars, or financial markets, produce multivariate time series data consisting of a large number of system measurements over a period of time. Such data can be interpreted as a sequence of states, where each state represents a prototype of system behavior. An important problem in this domain is to identify repeated sequences of states, known as motifs. Such motifs correspond to complex behaviors that capture common sequences of state transitions. For example, in automotive data, a motif of "making a turn" might manifest as a sequence of states: slowing down, turning the wheel, and then speeding back up. However, discovering these motifs is challenging, because the individual states and state assignments are unknown, have different durations, and need to be jointly learned from the noisy time series. Here we develop motif-aware state assignment (MASA), a method to discover common motifs in noisy time series data and leverage those motifs to more robustly assign states to measurements. We formulate the problem of motif discovery as a large optimization problem, which we solve using an expectation-maximization type approach. MASA performs well in the presence of noise in the input data and is scalable to very large datasets. Experiments on synthetic data show that MASA outperforms state-of-the-art baselines by up to 38.2%, and two case studies demonstrate how our approach discovers insightful motifs in the presence of noise in real-world time series data. CCS CONCEPTS• Computing methodologies → Cluster analysis; Motif discovery; Unsupervised learning.

show abstract

Sequence Repeats

Erciyes¹

2015

Computational Biology

View full text Add to dashboard Cite

ACME: A scalable parallel system for extracting frequent patterns from a very long sequence

Cited by 13 publications

References 27 publications

Dynamic Configuration of Partitioning in Spark Applications

Dynamic Configuration of Partitioning in Spark Applications

MASA: Motif-Aware State Assignment in Noisy Time Series Data

Sequence Repeats

Contact Info

Product

Resources

About