On the Existence and Significance of Data Preprocessing Biases in Web-Usage Mining

Zheng, Zhiqiang; Padmanabhan, Balaji; Kimbrough, Steven O.

doi:10.1287/ijoc.15.2.148.14449

Cited by 20 publications

(12 citation statements)

References 22 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Other data reduction and pre-processing techniques are discussed in Han and Kamber (2006). Zheng et al (2003) study some popular data-reduction methods and show that they are not without their pitfalls since different methods can lead to drastically different results, both in terms of characterizing the original data and out-of-sample predictions. In the analysis that follows, we show that our method of aggregation has no such problems (at least within the broad scope of our analysis).…”

Section: Related Literaturementioning

confidence: 99%

Customer-base analysis using repeated cross-sectional summary (RCSS) data

Jerath

Fader

Hardie

2016

European Journal of Operational Research

View full text Add to dashboard Cite

We address a critical question that many firms are facing today: Can customer data be stored and analyzed in an easy-to-manage and scalable manner without significantly compromising the inferences that can be made about the customers' transaction activity? We address this question in the context of customer-base analysis. A number of researchers have developed customer-base analysis models that perform very well given detailed individual-level data. We explore the possibility of estimating these models using aggregated data summaries alone, namely repeated cross-sectional summaries (RCSS) of the transaction data. Such summaries are easy to create, visualize, and distribute, irrespective of the size of the customer base. An added advantage of the RCSS data structure is that individual customers cannot be identified, which makes it desirable from a data privacy and security viewpoint as well. We focus on the widely used Pareto/NBD model and carry out a comprehensive simulation study covering a vast spectrum of market scenarios. We find that the RCSS format of four quarterly histograms serves as a suitable substitute for individual-level data. We confirm the results of the simulations on a real dataset of purchasing from an online fashion retailer.Keywords customer-base analysis, probability models, data aggregation, data privacy and security, information loss Customer-Base Analysis on a "Data Diet": Model Inference Using Repeated Cross-Sectional Summary (RCSS) Data AbstractWe address a critical question that many firms are facing in this era of "big data": Can customer data be stored and analyzed in an easy-to-manage and scalable manner without significantly compromising the inferences that can be made about the customers' transaction activity? We address this question in the context of customer-base analysis. A number of researchers have developed customer-base analysis models that perform very well given detailed individual-level data. We explore the possibility of estimating these models using aggregated data summaries alone, namely repeated cross-sectional summaries (RCSS) of the transaction data (e.g., four quarterly histograms). Such summaries are easy to create, visualize, and distribute, irrespective of the size of the customer base. An added advantage of RCSS data is that individual customers cannot be identified, which makes it desirable from a privacy viewpoint as well. We focus on the widely used Pareto/NBD model and carry out a comprehensive simulation study covering a vast spectrum of market scenarios. Our results consistently and convincingly establish that model performance associated with the use of three or four cross-sections of RCSS data (as judged by model fit, parameter recovery, and forward-looking metrics of customer value) can closely match the model performance associated with the use of individual-level data. We confirm the results of the simulations on a real dataset of purchases from an online fashion retailer. The thesis of our approach is that existing statistical models con...

show abstract

Section: Related Literaturementioning

confidence: 99%

Customer-base analysis using repeated cross-sectional summary (RCSS) data

Jerath

Fader

Hardie

2016

European Journal of Operational Research

View full text Add to dashboard Cite

show abstract

“…Padmanabhan, Zheng, and Kimbrough have studied the impact of data preparation alternatives upon Web-usage mining in Padmanabhan et al (2001) and in Zheng et al (2003). In Padmanabhan et al (2001), they focus on the prediction of purchase for users visiting multiple sites.…”

Section: Spiliopoulou Mobasher Berendt and Nakagawa A Framework Fomentioning

confidence: 99%

“…The authors show that when the analysis is based on the activities inside one site only, the accuracy of the predictors drops significantly. Zheng et al (2003) compare a set of methods for purchase prediction, each of which exploits different components of the users' sessions on which to make predictions. They compute the prediction accuracy of these methods using several classifiers.…”

Section: Spiliopoulou Mobasher Berendt and Nakagawa A Framework Fomentioning

confidence: 99%

“…In our study, we compare data preparation strategies whose goal is the reconstruction of the sessions, upon which preparation methods for prediction, as in Padmanabhan et al (2001) and Zheng et al (2003) are applied. Differently from these authors, we do not evaluate the performance of these strategies for a specific KDD application.…”

Section: Spiliopoulou Mobasher Berendt and Nakagawa A Framework Fomentioning

confidence: 99%

See 1 more Smart Citation

A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis

Spiliopoulou¹,

Mobasher²,

Berendt³

et al. 2003

INFORMS Journal on Computing

247

139

View full text Add to dashboard Cite

W eb-usage mining has become the subject of intensive research, as its potential for personalized services, adaptive Web sites and customer profiling is recognized. However, the reliability of Web-usage mining results depends heavily on the proper preparation of the input datasets. In particular, errors in the reconstruction of sessions and incomplete tracing of users' activities in a site can easily result in invalid patterns and wrong conclusions. In this study, we evaluate the performance of heuristics employed to reconstruct sessions from the server log data. Such heuristics are called to partition activities first by user and then by visit of the user in the site, where user identification mechanisms, such as cookies, may or may not be available. We propose a set of performance measures that are sensitive to two types of reconstruction errors and appropriate for different applications in knowledge discovery (KDD) applications.We have tested our framework on the Web server data of a frame-based Web site. The first experiment concerned a specific KDD application and has shown the sensitivity of the heuristics to particularities of the site's structure and traffic. The second experiment is not bound to a specific application but rather compares the performance of the heuristics for different measures and thus for different application types. Our results show that there is no single best heuristic, but our measures help the analyst in the selection of the heuristic best suited for the application at hand.

show abstract

“…The second eCRM problem, primarily in the DM domain, pertains to preprocessing click-stream data as the basis for building DM models, such as purchase prediction models. Zheng et al (2003) show that inappropriate preprocessing of data can result in significantly worse DM models for critical eCRM problems. Given the nature of click-stream data and the fact that hundreds of derived variables can be created from this data for a user session, it is important to partition…”

Section: Introductionmentioning

confidence: 99%

On the Use of Optimization for Data Mining: Theoretical Interactions and eCRM Opportunities

Padmanabhan

Tuzhilin

2003

Management Science

Self Cite

View full text Add to dashboard Cite

P revious work on the solution to analytical electronic customer relationship management (eCRM) problems has used either data-mining (DM) or optimization methods, but has not combined the two approaches. By leveraging the strengths of both approaches, the eCRM problems of customer analysis, customer interactions, and the optimization of performance metrics (such as the lifetime value of a customer on the Web) can be better analyzed. In particular, many eCRM problems have been traditionally addressed using DM methods. There are opportunities for optimization to improve these methods, and this paper describes these opportunities. Further, an online appendix (mansci.pubs.informs.org/ecompanion.html) describes how DM methods can help optimization-based approaches. More generally, this paper argues that the reformulation of eCRM problems within this new framework of analysis can result in more powerful analytical approaches.

show abstract

On the Existence and Significance of Data Preprocessing Biases in Web-Usage Mining

Cited by 20 publications

References 22 publications

Customer-base analysis using repeated cross-sectional summary (RCSS) data

Customer-base analysis using repeated cross-sectional summary (RCSS) data

A Framework for the Evaluation of Session Reconstruction Heuristics in Web-Usage Analysis

On the Use of Optimization for Data Mining: Theoretical Interactions and eCRM Opportunities

Contact Info

Product

Resources

About