Constrained sequential pattern mining aims at identifying frequent patterns on a sequential database of items while observing constraints defined over the item attributes. We introduce novel techniques for constraint-based sequential pattern mining that rely on a multi-valued decision diagram representation of the database. Specifically, our representation can accommodate multiple item attributes and various constraint types, including a number of non-monotone constraints. To evaluate the applicability of our approach, we develop an MDD-based prefix-projection algorithm and compare its performance against a typical generate-and-check variant, as well as a state-of-the-art constraint-based sequential pattern mining algorithm. Results show that our approach is competitive with or superior to these other methods in terms of scalability and efficiency.
Pattern mining is an essential part of knowledge discovery and data analytics. It is a powerful paradigm, especially when combined with constraint reasoning. In this paper, we present Seq2Pat, a constraint-based sequential pattern mining tool with a high-level declarative user interface. The library finds patterns that frequently occur in large sequence databases subject to constraints. We highlight key benefits that are desirable, especially in industrial settings where scalability, explainability, rapid experimentation, reusability, and reproducibility are of great interest. We then showcase an automated feature extraction process powered by Seq2Pat to discover high-level insights and boost downstream machine learning models for customer intent prediction.
This paper develops an exact solution algorithm for the multiple sequence alignment (MSA) problem. In the first step, we design a dynamic programming model and use it to construct a novel multivalued decision diagram (MDD) representation of all pairwise sequence alignments (PSA). PSA MDDs are then synchronized using side constraints to model the MSA problem as a mixed-integer program (MIP), for the first time, in polynomial space complexity. Two bound-based filtering procedures are developed to reduce the size of the MDDs, and the resulting MIP is solved using logic-based Benders decomposition. For a more effective algorithm, we develop a two-phase solution approach. In the first phase, we use optimistic filtering to quickly obtain a near-optimal bound, which we then use for exact filtering in the second phase to prove or obtain an optimal solution. Numerical results on benchmark instances show that our algorithm solves several instances to optimality for the first time, and, in case optimality cannot be proven, considerably improves upon a state-of-the-art heuristic MSA solver. Comparison with an existing state-of-the-art exact MSA algorithm shows that our approach is more time efficient and yields significantly smaller optimality gaps.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.