Two important and active areas of current research are data mining and the World Wide Web. A natural combination of the two areas, sometimes referred to as Web mining, has been the focus of several recent research projects and papers. As with any emerging research area there is no established vocabulary, leading to confusion when comparing research efforts. Different terms for the same concept or different definitions being attached to the same word are commonplace. The term Web mining has been used in two distinct ways. The first, which is referred to as Web content mining in this paper, describes the process of information or resource discovery from millions of sources across the World Wide Web. The second, which we call Web usage mining, is the process of mining Web access logs or other user information user browsing and access patterns on one or more Web localities. In this paper we define Web mining and, in particular, present an overview of the various research issues, techniques, and development efforts in Web content mining and Web usage mining. We focus mainly on the problems and proposed techniques associated with Web usage mining as an emerging research area. We also present a general architecture for Web usage mining and briefly describe the WEBMINER, a system based on the proposed architecture. We conclude this paper by listing issues that need the attention of the research community.
Abstract. Objective measures such as support, confidence, interest factor, correlation, and entropy are often used to evaluate the interestingness of association patterns. However, in many situations, these measures may provide conflicting information about the interestingness of a pattern. Data mining practitioners also tend to apply an objective measure without realizing that there may be better alternatives available for their application. In this paper, we describe several key properties one should examine in order to select the right measure for a given application. A comparative study of these properties is made using twenty-one measures that were originally developed in diverse fields such as statistics, social science, machine learning, and data mining. We show that depending on its properties, each measure is useful for some application, but not for others. We also demonstrate two scenarios in which many existing measures become consistent with each other, namely, when support-based pruning and a technique known as table standardization are applied. Finally, we present an algorithm for selecting a small set of patterns such that domain experts can find a measure that best fits their requirements by ranking this small set of patterns.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.