Sequential pattern mining remains a challenging task due to the large number of redundant candidate patterns and the exponential search space. In addition, further analysis is still required to map extracted patterns to different outcomes. In this paper, we introduce a pattern mining framework that operates on semi-structured datasets and exploits the dichotomy between outcomes. Our approach takes advantage of constraint reasoning to find sequential patterns that occur frequently and exhibit desired properties. This allows the creation of novel pattern embeddings that are useful for knowledge extraction and predictive modeling. Based on dichotomic pattern mining, we present two real-world applications for customer intent prediction and intrusion detection. Overall, our approach plays an integrator role between semi-structured sequential data and machine learning models, improves the performance of the downstream task, and retains interpretability.
Search is one of the most common platforms used to seek information. However, users mostly get overloaded with results whenever they use such a platform to resolve their queries. Nowadays, direct answers to queries are being provided as a part of the search experience. The question-answer (QA) retrieval process plays a significant role in enriching the search experience. Most off-the-shelf Semantic Textual Similarity models work fine for well-formed search queries, but their performances degrade when applied to a domain-specific setting having incomplete or grammatically ill-formed search queries in prevalence. In this paper, we discuss a framework for calculating similarities between a given input query and a set of predefined questions to retrieve the question which matches to it the most. We have used it for the financial domain, but the framework is generalized for any domainspecific search engine and can be used in other domains as well. We use Siamese network [6] over Long Short-Term Memory (LSTM) [3] models to train a classifier which generates unnormalized and normalized similarity scores for a given pair of questions. Moreover, for each of these question pairs, we calculate three other similarity scores: cosine similarity between their average word2vec embeddings [15], cosine similarity between their sentence embeddings [7] generated using RoBERTa [17] and their customized fuzzy-match score. Finally, we develop a metaclassifier using Support Vector Machines [19] for combining these five scores to detect if a given pair of questions is similar. We benchmark our model's performance against existing State Of The Art (SOTA) models on Quora Question Pairs (QQP) dataset 1 as well as a dataset specific to the financial domain. After evaluating its performance on the financial domain specific data, we conclude that it not only outperforms several existing SOTA models on F1 score but also has decent accuracy.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.