We are given a large database of customer transactions. Each transaction consists of items purchased by a customer in a visit. We present an e cient algorithm that generates all signi cant association rules between items in the database. The algorithm incorporates bu er management and novel estimation and pruning techniques. We also present results of applying this algorithm to sales data obtained from a large retailing company, which shows the e ectiveness of the algorithm.
We propose an indexing method for time sequences for processing similarity queries. We use the Discrete Fourier Transform (DFT) to map time sequences to the frequency domain, the crucial observation being that, for most sequences of practical interest, only the rst few frequencies are strong. Another important observation is Parseval's theorem, which speci es that the Fourier transform preserves the Euclidean distance in the time or frequency domain. Having thus mapped sequences to a lowerdimensionality space by using only the rst few Fourier coe cients, we use Rtrees to index the sequences and e ciently answer similarity queries. We provide experimental results which show that our method is superior to search based on sequential scanning. Our experiments show that a few coe cients (1-3) are adequate to provide good performance. The performance gain of our method increases with the number and length of sequences.
We present our perspective of database mining as the con uence of machine learning techniques and the performance emphasis of database technology. W e describe three classes of database mining problems involving classi cation, associations, and sequences, and argue that these problems can be uniformly viewed as requiring discovery of rules embedded in massive data. We describe a model and some basic operations for the process of rule discovery. W e show h o w the database mining problems we consider map to this model and how they can be solved by using the basic operations we propose. We give an example of an algorithm for classi cation obtained by combining the basic rule discovery operations. This algorithm not only is e cient in discovering classi cation rules but also has accuracy comparable to ID3, one of the current best classi ers.
We describe set-oriented algorithms for mining association rules. Such algorithms imply performing multiple joins and may appear to be inherently less escient than special-purpose algorithms. W e develop new algorithms that can be expressed as SQL queries, and discuss optimization of these algorithms. After analytical evaluation, an algorithm named S E T M emerges as the algorithm of choice. Algorithm S E T M uses only simple database primitives, viz., sorting and merge-scan join. Algorithm S E T M is simple, fast, and stable over the mnge of pammeter values. The major contribution of this paper is that it shows that at least some aspects of data mining can be cam'ed out by using general query languages such as SQL, mther than by developing specialized black box algorithms. The set-oriented nature of Algorithm S E T M facilitates the development of extensions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.