Set-valued data, in which a set of values are associated with an individual, is common in databases ranging from market basket data, to medical databases of patients' symptoms and behaviors, to query engine search logs. Anonymizing this data is important if we are to reconcile the conflicting demands arising from the desire to release the data for study and the desire to protect the privacy of individuals represented in the data. Unfortunately, the bulk of existing anonymization techniques, which were developed for scenarios in which each individual is associated with only one sensitive value, are not well-suited for set-valued data. In this paper we propose a top-down, partition-based approach to anonymizing set-valued data that scales linearly with the input size and scores well on an information-loss data quality metric. We further note that our technique can be applied to anonymize the infamous AOL query logs, and discuss the merits and challenges in anonymizing query logs using our approach.
We investigate algorithms for evaluating moving window joins over pairs of unbounded streams. We introduce a unit-time-basis cost model to analyze the expected performance of these algorithms. Using this cost model, we propose strategies for maximizing the efficiency of processing joins in three scenarios. First, we consider the case where one stream is much faster than the other. We show that an asymmetric combination of hash join and nested loops join (hash join on one input, nested loops join on the other) can outperform both the symmetric hash join and the symmetric nested loops join. Second, we investigate the case where system resources are insufficient to keep up with the input streams. We show that we can maximize the number of join result tuples produced in this case by properly allocating computing resources across the two inputs streams. Finally, we investigate strategies for maximizing the number of result tuples produced when memory is limited. We revisit the first two cases in this context, and show that proper memory allocation across the two input streams can result in significantly lower resource usage and/or more result tuples produced.
While significant progress has been made separately on analytics systems for scalable stochastic gradient descent (SGD) and private SGD, none of the major scalable analytics frameworks have incorporated differentially private SGD. There are two inter-related issues for this disconnect between research and practice: (1) low model accuracy due to added noise to guarantee privacy, and (2) high development and runtime overhead of the private algorithms. This paper takes a first step to remedy this disconnect and proposes a private SGD algorithm to address both issues in an integrated manner. In contrast to the whitebox approach adopted by previous work, we revisit and use the classical technique of output perturbation to devise a novel "bolt-on" approach to private SGD. While our approach trivially addresses (2), it makes (1) even more challenging. We address this challenge by providing a novel analysis of the L 2 -sensitivity of SGD, which allows, under the same privacy guarantees, better convergence of SGD when only a constant number of passes can be made over the data. We integrate our algorithm, as well as other state-of-the-art differentially private SGD, into Bismarck, a popular scalable SGD-based analytics system on top of an RDBMS. Extensive experiments show that our algorithm can be easily integrated, incurs virtually no overhead, scales well, and most importantly, yields substantially better (up to 4X) test accuracy than the state-of-the-art algorithms on many real datasets.
Recently there has been a growing interest in join query evaluation for scenarios in which inputs arrive at highly variable and unpredictable rates. In such scenarios, the focus shifts from completing the computation as soon as possible to producing a prefix of the output as soon as possible. To handle this shift in focus, most solutions to date rely upon some combination of streaming binary operators and "on-the-fly" execution plan reorganization. In contrast, we consider the alternative of extending existing symmetric binary join operators to handle more than two inputs. Toward this end, we have completed a prototype implementation of a multi-way join operator, which we term the "MJoin" operator, and explored its performance. Our results show that in many instances the MJoin produces outputs sooner than any tree of binary operators. Additionally, since MJoins are completely symmetric with respect to their inputs, they can reduce the need for expensive runtime plan reorganization. This suggests that supporting multi-way joins in a single, symmetric, streaming operator may be a useful addition to systems that support queries over input streams from remote sites.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.