Kaustubh Beedkar scite author profile

Gemulla

2016

Abstract-Frequent sequence mining methods often make use of constraints to control which subsequences should be mined; e.g., length, gap, span, regular-expression, and hierarchy constraints. We show that many subsequence constraints-including and beyond those considered in the literature-can be unified in a single framework. In more detail, we propose a set of simple and intuitive "pattern expressions" to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners.

A Unified Framework for Frequent Sequence Mining with Subsequence Constraints

ACM Trans. Database Syst.

Gemulla

Martens

2019

Frequent sequence mining methods often make use of constraints to control which subsequences should be mined. A variety of such subsequence constraints has been studied in the literature, including length, gap, span, regular-expression, and hierarchy constraints. In this article, we show that many subsequence constraints-including and beyond those considered in the literature-can be unified in a single framework. A unified treatment allows researchers to study jointly many types of subsequence constraints (instead of each one individually) and helps to improve usability of pattern mining systems for practitioners. In more detail, we propose a set of simple and intuitive "pattern expressions" to describe subsequence constraints and explore algorithms for efficiently mining frequent subsequences under such general constraints. Our algorithms translate pattern expressions to succinct finite-state transducers, which we use as computational model, and simulate these transducers in a way suitable for frequent sequence mining. Our experimental study on real-world datasets indicates that our algorithms-although more general-are efficient and, when used for sequence mining with prior constraints studied in literature, competitive to (and in some cases superior to) state-of-the-art specialized methods.

Compliant Geo-distributed Query Processing

Quiané-Ruiz

Markl

2021

Lash

Gemulla

2015

We propose LASH, a scalable, distributed algorithm for mining sequential patterns in the presence of hierarchies. LASH takes as input a collection of sequences, each composed of items from some application-specific vocabulary. In contrast to traditional approaches to sequence mining, the items in the vocabulary are arranged in a hierarchy: both input sequences and sequential patterns may consist of items from different levels of the hierarchy. Such hierarchies naturally occur in a number of applications including mining natural-language text, customer transactions, error logs, or event sequences. LASH is the first parallel algorithm for mining frequent sequences with hierarchies; it is designed to scale to very large datasets. At its heart, LASH partitions the data using a novel, hierarchy-aware variant of item-based partitioning and subsequently mines each partition independently and in parallel using a customized mining algorithm called pivot sequence miner. LASH is amenable to a MapReduce implementation; we propose effective and efficient algorithms for both the construction and the actual mining of partitions. Our experimental study on large real-world datasets suggest good scalability and run-time efficiency.

Compliant geo-distributed data processing in action

et al. 2021

In this paper we present our work on compliant geo-distributed data processing. Our work focuses on the new dimension of dataflow constraints that regulate the movement of data across geographical or institutional borders. For example, European directives may regulate transferring only certain information fields (such as non personal information) or aggregated data. Thus, it is crucial for distributed data processing frameworks to consider compliance with respect to dataflow constraints derived from these regulations. We have developed a compliance-based data processing framework, which (i) allows for the declarative specification of dataflow constraints, (ii) determines if a query can be translated into a compliant distributed query execution plan, and (iii) executes the compliant plan over distributed SQL databases. We demonstrate our framework using a geo-distributed adaptation of the TPC-H benchmark data. Our framework provides an interactive dashboard, which allows users to specify dataflow constraints, and analyze and execute compliant distributed query execution plans.