Behavioral targeting (BT) leverages historical user behavior to select the ads most relevant to users to display. The state-of-the-art of BT derives a linear Poisson regression model from fine-grained user behavioral data and predicts click-through rate (CTR) from user history. We designed and implemented a highly scalable and efficient solution to BT using Hadoop MapReduce framework. With our parallel algorithm and the resulting system, we can build above 450 BT-category models from the entire Yahoo's user base within one day, the scale that one can not even imagine with prior systems. Moreover, our approach has yielded 20% CTR lift over the existing production system by leveraging the well-grounded probabilistic model fitted from a much larger training dataset.Specifically, our major contributions include: (1) A MapReduce statistical learning algorithm and implementation that achieve optimal data parallelism, task parallelism, and load balance in spite of the typically skewed distribution of domain data. (2) An in-place feature vector generation algorithm with linear time complexity O(n) regardless of the granularity of sliding target window. (3) An in-memory caching scheme that significantly reduces the number of disk IOs to make large-scale learning practical. (4) Highly efficient data structures and sparse representations of models and data to enable fast model updates. We believe that our work makes significant contributions to solving large-scale machine learning problems of industrial relevance in general. Finally, we report comprehensive experimental results, using industrial proprietary codebase and datasets.
We investigate the problem of generating fast approximate answers to queries posed to large sparse binary data sets. We focus in particular on probabilistic model-based approaches to this problem and develop a number of techniques that are significantly more accurate than a baseline independence model. In particular, we introduce two techniques for building probabilistic models from frequent itemsets: the itemset maximum entropy method, and the itemset inclusion-exclusion model. In the maximum entropy method we treat itemsets as constraints on the distribution of the query variables and use the maximum entropy principle to build a joint probability model for the query attributes online. In the inclusion-exclusion model itemsets and their frequencies are stored in a data structure called an ADtree that supports an efficient implementation of the inclusion-exclusion principle in order to answer the query. We empirically compare these two itemset-based models to direct querying of the original data, querying of samples of the original data, as well as other probabilistic models such as the independence model, the Chow-Liu tree model, and the Bernoulli mixture model. These models are able to handle high-dimensionality (hundreds or thousands of attributes), whereas most other work on this topic has focused on relatively low-dimensional OLAP problems. Experimental results on both simulated and real-world transaction data sets illustrate various fundamental tradeoffs between approximation error, model complexity, and the online time required to compute a query answer. Index TermsBinary transaction data, query approximation, probabilistic model, itemsets, ADTree, maximum entropy. 2 I. INTRODUCTION Massive data sets containing huge numbers of records are of increasing interest to organizations that routinely collect such data and to data miners who try to find regularities in them. One class of such data is transaction data with rows corresponding to transactions and columns corresponding to particular items or attributes. This class is typically characterized by sparseness, i.e., there may be hundreds or thousands of binary attributes but a particular record may only have a few of them set to 1.An example of a binary transaction data set is a Web log that records page requests for a particular Web site. The rows (records) in such a data set correspond to various users accessing the site and columns (attributes) correspond to different pages within the site. Clearly, most of the users access only a small fraction of pages, making the data set sparse. Within a single day popular Web sites can produce millions of records.Query selectivity estimation for such binary data can be defined as follows. Let R = {A 1 , . . . , A k } be a table header with k 0/1 valued attributes (variables) and r be a table of n rows over header R. We assume that k n, and that the data are sparse i.e., the average number of 1's per row is substantially smaller than the number of attributes. By definition, a row of the table r satisfies a c...
We present a mixture model based approach for learning
Support vector machines (SVMs) provide classi cation models with strong theoretical foundations as well as excellent empirical performance on a variety of applications. One of the major drawbacks of SVMs is the necessity to solve a large-scale quadratic programming problem. This paper combines likelihood-based squashing with a probabilistic formulation of SVMs, enabling fast training on squashed data sets. We reduce the problem of training the SVMs on the weighted \squashed" data to a quadratic programming problem and show that it can be solved using Platt's sequential minimal optimization (SMO) algorithm. W e compare performance of the SMO algorithm on the squashed and the full data, as well as on simple random and boosted samples of the data. Experiments on a number of datasets show that squashing allows one to speed-up training, decrease memory requirements, and obtain parameter estimates close to that of the full data. More importantly, squashing produces close to optimal classi cation accuracies.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.