ALPINE: Progressive Itemset Mining with Definite Guarantees

Hu, Qiong; Imieliński, Tomasz

doi:10.1137/1.9781611974973.8

Cited by 7 publications

(6 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since survey [31] mentioned frequent itemset mining (FIM) as a tool to identify strong associations between allelic combinations associated with diseases, the proposed algorithm needs further comparison with other approaches from FIM like DeBi [32] and anytime discovery approaches like Alpine [33] tested on GEA datasets as well; though their use may get complicated if we need to keep information about object names for decision-makers. It also requires further time complexity improvements to increase the scalability and quality of the extensive bicluster finding process for massive datasets.…”

Section: Resultsmentioning

confidence: 99%

Object-Attribute Biclustering for Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data

Ignatov

Khvorykh

Khrunin

et al. 2020

Preprint

View full text Add to dashboard Cite

Missing genotypes can affect the efficacy of machine learning approaches to identify the risk genetic variants of common diseases and traits. The problem occurs when genotypic data are collected from different experiments with different DNA microarrays, each being characterised by its pattern of uncalled (missing) genotypes. This can prevent the machine learning classifier from assigning the classes correctly. To tackle this issue, we used well-developed notions of object-attribute biclusters and formal concepts that correspond to dense subrelations in the binary relation patients x SNPs. The paper contains experimental results on applying a biclustering algorithm to a large real-world dataset collected for studying the genetic bases of ischemic stroke. The algorithm could identify large dense biclusters in the genotypic matrix for further processing, which in return significantly improved the quality of machine learning classifiers. The proposed algorithm was also able to generate biclusters for the whole dataset without size constraints in comparison to the In-Close4 algorithm for generation of formal concepts.

show abstract

Section: Resultsmentioning

confidence: 99%

Object-Attribute Biclustering for Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data

Ignatov

Khvorykh

Khrunin

et al. 2020

Preprint

View full text Add to dashboard Cite

show abstract

“…Here, we clarify our contribution through the comparison with the related works. There are many previous works (Xin et al 2005;Cheng et al 2006Cheng et al , 2008Song et al 2007;Boley et al 2009Boley et al , 2010Liu et al 2012;Quadrana et al 2015;Hu and Imielinski 2017) on PC mining that deal with lossy condensed representations of the FIs. However, most of those previous works, expect for Song et al (2007), Cheng et al (2008), and Quadrana et al (2015), are oriented to a transactional database that tolerates multiple scanning and allows us to assume a stable distribution of occurrences.…”

Section: Related Workmentioning

confidence: 99%

PARASOL: a hybrid approximation approach for scalable frequent itemset mining in streaming data

2019

View full text Add to dashboard Cite

Here, we present a novel algorithm for frequent itemset mining in streaming data (FIM-SD). For the past decade, various FIM-SD methods in one-pass approximation settings that allow to approximate the support of each itemset have been proposed. They can be categorized into two approximation types: parameter-constrained (PC) mining and resource-constrained (RC) mining. PC methods control the maximum error that can be included in the approximate support based on a pre-defined parameter. In contrast, RC methods limit the maximum memory consumption based on resource constraints. However, the existing PC methods can exponentially increase the memory consumption, while the existing RC methods can rapidly increase the maximum error. In this study, we address this problem by introducing a hybrid approach of PC-RC approximations, called PARASOL. For any streaming data, PARASOL ensures to provide a condensed representation, called a-covered set, which is regarded as an extension of the closedness compression; when = 0, the solution corresponds to the ordinary closed itemsets. PARASOL searches for such approximate closed itemsets that can restore the frequent itemsets and their supports while the maximum error is bounded by an integer,. Then, we empirically demonstrate that the proposed algorithm significantly outperforms the state-of-the-art PC and RC methods for FIM-SD.

show abstract

“…the corresponding closed itemset. Nowadays, there exist very efficient algorithms for computing frequent closed itemsets (Hu and Imielinski, 2017;Uno et al, 2005). Even for a low frequency threshold, they are able to efficiently generate an exponential number of closed itemsets.…”

Section: Introductionmentioning

confidence: 99%

Discovery data topology with the closure structure. Theoretical and practical aspects

Makhalova,

Buzmakov,

Kuznetsov

et al. 2020

Preprint

View full text Add to dashboard Cite

In this paper, we are revisiting pattern mining and especially itemset mining, which allows one to analyze binary datasets in searching for interesting and meaningful association rules and respective itemsets in an unsupervised way. While a summarization of a dataset based on a set of patterns does not provide a general and satisfying view over a dataset, we introduce a concise representation -the closure structure-based on closed itemsets and their minimum generators, for capturing the intrinsic content of a dataset. The closure structure allows one to understand the topology of the dataset in the whole and the inherent complexity of the data. We propose a formalization of the closure structure in terms of Formal Concept Analysis, which is well adapted to study this data topology. We present and demonstrate theoretical results, and as well, practical results using the GDPM algorithm. GDPM is rather unique in its functionality as it returns a characterization of the topology of a dataset in terms of complexity levels, highlighting the diversity and the distribution of the itemsets. Finally a series of experiments shows how GDPM can be practically used and what can be expected from the output.

show abstract

ALPINE: Progressive Itemset Mining with Definite Guarantees

Cited by 7 publications

References 12 publications

Object-Attribute Biclustering for Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data

Object-Attribute Biclustering for Elimination of Missing Genotypes in Ischemic Stroke Genome-Wide Data

PARASOL: a hybrid approximation approach for scalable frequent itemset mining in streaming data

Discovery data topology with the closure structure. Theoretical and practical aspects

Contact Info

Product

Resources

About