In this paper, a new technique has been suggested for extracting textual maximal frequent itemsets named Maximal Itemset Miner Algorithm (MIMA). This algorithm begins search process through generating the best initial border in search space depending on minimum support of items in the first level that achieves the general minimum support determined by the user. Our approach for counting itemsets support combines the idea of vertical representation of the data with a queue data structure to store the itemsets. To reduce search space, the algorithm adopted several pruning conditions for each itemsets in the initial border. Experiments performed on standard textual CNN Arabic dataset and proposed method registers less execution time comparing with the Apriori algorithm when applying it on three different size datasets.
Many data mining techniques and machine learning algorithms have been developed to classify textual data involving decision tree, support vector machine, K-Nearest neighbour, in addition to machine learning-based algorithms. Association rules based machine learning is accomplished in two phases; training phase and testing phase that may be reinforced to enhance the classification accuracy according to new minimum support and confidence. Association rules mining/processing, in its various applications, passes through two massive computation steps; frequent itemsets mining and association rules extraction. This paper presents a general algorithm for association rules-based machine learning dedicated to text classification. To verify the efficiency of the algorithm, different text datasets were used such as tweets dataset for sentiment classification, pdf documents and HTML documents. Experiments of sentiment classification showed that the classifier constructed according to minsup threshold =%700 and minconf threshold =50% gives the best performance with F1 = 0.9861811 while the experiments of HTML and PDF appeared accurate classification equal to (94%).
In this paper we introduced techniques for classifying Arabic documents depending on association rules built from maximal frequent itemsets. Parallel Maximal Itemset Miner Algorithm (PMIMA) adopted several conditions to prune search space parallelly introduced for extracting maximal frequent itemsets. Rule length, rule weight and rule majority are three classification methods exploited to classification Arabic documents. Comparing with classification results obtained depending on all frequent itemsets extracted by Apriori, we proved efficiency of ours approach.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.