Abstract:This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e.g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions. Distributing synthetic data as rewards (instead of trained models or money) offers taskand model-agnostic benefits for downstream learning tasks and is less likely to violate data privac… Show more
“…where F is the class of functions f in the unit ball of the reproducing kernel Hilbert space associated with a kernel function k. We defer the discussion on kernels appropriate for use with MMD to (Tay et al 2021) b (F , S, T ) of the squared MMD can be obtained in the form of matrix Frobenius inner products, as shown in (Gretton et al 2012):…”
Section: Data Valuation With Maximum Mean Discrepancy (Mmd)mentioning
confidence: 99%
“…which is a reasonable choice for our problem setting under the following practical assumptions: (A) Every party benefits from having data drawn from D besides having just its dataset D i since D i may only be sampled from a restricted subset of the support of D. We discuss its validity in (Tay et al 2021).…”
Section: Data Valuation With Maximum Mean Discrepancy (Mmd)mentioning
confidence: 99%
“…Given that v c is non-negative and monotonically increasing (a later section will show sufficient conditions that guarantee these properties), the reward scheme of Sim et al (2020) exploits the notion of ρ-Shapley fair reward values r i := (ϕ i /ϕ * ) ρ × v c (N ) for each party i ∈ N with an adjustable parameter ρ to trade off between satisfying the incentives. For your convenience, we reproduce their main result and full definitions in (Tay et al 2021).…”
Section: Reward Scheme For Guaranteeing Incentives In Cgm Frameworkmentioning
confidence: 99%
“…This formulation also informs us of a suitable choice of the synthetic dataset G: A sufficient but not necessary condition for the feasible set of the LP to be non-empty is min i∈N v max i ≥ max i∈N v min i . When generating the synthetic dataset G, we may thus increase the size of G until this condition is satisfied; we provide an intuition for why this works in (Tay et al 2021).…”
Section: A Modified Reward Scheme With Rectified ρ-Shapley Fair Rewar...mentioning
confidence: 99%
“…In each iteration of our weighted sampling algorithm for distributing synthetic data reward to party i (Algo. 1) in (Tay et al 2021)), we firstly perform min-max normalization to rescale ∆ x to ∆x for all synthetic data points x ∈ G \ G i to lie within the [0, 1] interval. We compute the probability of each synthetic data point x being sampled using the softmax function: p(x) = exp (β ∆x )/ x ′ ∈G\Gi exp (β ∆x ′ ) where β ∈ [0, ∞) is the inverse temperature hyperparameter.…”
Section: Distributing Synthetic Data Rewards To Parties Via Weighted ...mentioning
This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e.g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions. Distributing synthetic data as rewards (instead of trained models or money) offers task- and model-agnostic benefits for downstream learning tasks and is less likely to violate data privacy regulation. To realize the framework, we firstly propose a data valuation function using maximum mean discrepancy (MMD) that values data based on its quantity and quality in terms of its closeness to the true data distribution and provide theoretical results guiding the kernel choice in our MMD-based data valuation function. Then, we formulate the reward scheme as a linear optimization problem that when solved, guarantees certain incentives such as fairness in the CGM framework. We devise a weighted sampling algorithm for generating synthetic data to be distributed to each party as reward such that the value of its data and the synthetic data combined matches its assigned reward value by the reward scheme. We empirically show using simulated and real-world datasets that the parties' synthetic data rewards are commensurate to their contributions.
“…where F is the class of functions f in the unit ball of the reproducing kernel Hilbert space associated with a kernel function k. We defer the discussion on kernels appropriate for use with MMD to (Tay et al 2021) b (F , S, T ) of the squared MMD can be obtained in the form of matrix Frobenius inner products, as shown in (Gretton et al 2012):…”
Section: Data Valuation With Maximum Mean Discrepancy (Mmd)mentioning
confidence: 99%
“…which is a reasonable choice for our problem setting under the following practical assumptions: (A) Every party benefits from having data drawn from D besides having just its dataset D i since D i may only be sampled from a restricted subset of the support of D. We discuss its validity in (Tay et al 2021).…”
Section: Data Valuation With Maximum Mean Discrepancy (Mmd)mentioning
confidence: 99%
“…Given that v c is non-negative and monotonically increasing (a later section will show sufficient conditions that guarantee these properties), the reward scheme of Sim et al (2020) exploits the notion of ρ-Shapley fair reward values r i := (ϕ i /ϕ * ) ρ × v c (N ) for each party i ∈ N with an adjustable parameter ρ to trade off between satisfying the incentives. For your convenience, we reproduce their main result and full definitions in (Tay et al 2021).…”
Section: Reward Scheme For Guaranteeing Incentives In Cgm Frameworkmentioning
confidence: 99%
“…This formulation also informs us of a suitable choice of the synthetic dataset G: A sufficient but not necessary condition for the feasible set of the LP to be non-empty is min i∈N v max i ≥ max i∈N v min i . When generating the synthetic dataset G, we may thus increase the size of G until this condition is satisfied; we provide an intuition for why this works in (Tay et al 2021).…”
Section: A Modified Reward Scheme With Rectified ρ-Shapley Fair Rewar...mentioning
confidence: 99%
“…In each iteration of our weighted sampling algorithm for distributing synthetic data reward to party i (Algo. 1) in (Tay et al 2021)), we firstly perform min-max normalization to rescale ∆ x to ∆x for all synthetic data points x ∈ G \ G i to lie within the [0, 1] interval. We compute the probability of each synthetic data point x being sampled using the softmax function: p(x) = exp (β ∆x )/ x ′ ∈G\Gi exp (β ∆x ′ ) where β ∈ [0, ∞) is the inverse temperature hyperparameter.…”
Section: Distributing Synthetic Data Rewards To Parties Via Weighted ...mentioning
This paper presents a novel collaborative generative modeling (CGM) framework that incentivizes collaboration among self-interested parties to contribute data to a pool for training a generative model (e.g., GAN), from which synthetic data are drawn and distributed to the parties as rewards commensurate to their contributions. Distributing synthetic data as rewards (instead of trained models or money) offers task- and model-agnostic benefits for downstream learning tasks and is less likely to violate data privacy regulation. To realize the framework, we firstly propose a data valuation function using maximum mean discrepancy (MMD) that values data based on its quantity and quality in terms of its closeness to the true data distribution and provide theoretical results guiding the kernel choice in our MMD-based data valuation function. Then, we formulate the reward scheme as a linear optimization problem that when solved, guarantees certain incentives such as fairness in the CGM framework. We devise a weighted sampling algorithm for generating synthetic data to be distributed to each party as reward such that the value of its data and the synthetic data combined matches its assigned reward value by the reward scheme. We empirically show using simulated and real-world datasets that the parties' synthetic data rewards are commensurate to their contributions.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.