Biased Stochastic Gradient Descent for Conditional Stochastic Optimization

Hu, Yifan; Zhang, Siqi; Chen, Xin; He, Niao

doi:10.48550/arxiv.2002.10790

Cited by 7 publications

(12 citation statements)

References 18 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Since this is a standard results, similar results are showed in Bernstein et al (2018);Devolder et al (2014); Hu et al (2020); Ajalloeian and Stich (2020). For the sake of completeness, we provide the…”

Section: Methodssupporting

confidence: 84%

“…The convergence of biased gradient has been studied via a series of previous works (Schmidt et al, 2011;Bernstein et al, 2018;Hu et al, 2020;Ajalloeian and Stich, 2020;Scaman and Malherbe, 2020). We show a similar theorem below for the sake of completeness.…”

Section: Stochastic Gradient Descent With Biased Gradientmentioning

confidence: 99%

“…The relationship between gradient estimation and its final convergence has been widely studied in the optimization community. Since computing an approximated (and potentially biased) gradient is often more efficient than computing the exact gradient, many studies used approximated gradients to optimize their models and showed that they suffer from the biased estimation problem if there is no assumption on the gradient estimation (d 'Aspremont, 2008;Schmidt et al, 2011;Bernstein et al, 2018;Hu et al, 2020;Ajalloeian and Stich, 2020).…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Learning Deep Neural Networks under Agnostic Corrupted Supervision

Liu,

Sun,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

Training deep neural models in the presence of corrupted supervision is challenging as the corrupted data points may significantly impact the generalization performance. To alleviate this problem, we present an efficient robust algorithm that achieves strong guarantees without any assumption on the type of corruption, and provides a unified framework for both classification and regression problems. Unlike many existing approaches that quantify the quality of the data points (e.g., based on their individual loss values), and filter them accordingly, the proposed algorithm focuses on controlling the collective impact of data points on the average gradient. Even when a corrupted data point failed to be excluded by our algorithm, the data point will have very limited impact on the overall loss, as compared with state-of-the-art filtering methods based on loss values. Extensive experiments on multiple benchmark datasets have demonstrated the robustness of our algorithm under different types of corruptions.

show abstract

Section: Methodssupporting

confidence: 84%

Section: Stochastic Gradient Descent With Biased Gradientmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Learning Deep Neural Networks under Agnostic Corrupted Supervision

Liu,

Sun,

Wang

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To theoretically understand why GBML works well in practice, we shall comprehend the optimization properties of GBML with DNNs. Several recent works theoretically analyze GBML in the case of convex objectives [12,5,23,16,44]. However, DNNs are always non-convex, so these works do not directly apply to GBML with DNNs.…”

Section: Motivationsmentioning

confidence: 99%

Global Convergence and Generalization Bound of Gradient-Based Meta-Learning with Deep Neural Nets

Haoxiang¹,

Sun²,

Li³

2020

Preprint

View full text Add to dashboard Cite

Gradient-based meta-learning (GBML) with deep neural nets (DNNs) has become a popular approach for few-shot learning. However, due to the non-convexity of DNNs and the complex bi-level optimization in GBML, the theoretical properties of GBML with DNNs remain largely unknown. In this paper, we first develop a novel theoretical analysis to answer the following questions: Does GBML with DNNs have global convergence guarantees? We provide a positive answer to this question by proving that GBML with over-parameterized DNNs is guaranteed to converge to global optima at a linear rate. The second question we aim to address is: How does GBML achieve fast adaption to new tasks with experience on past similar tasks? To answer it, we prove that GBML is equivalent to a functional gradient descent operation that explicitly propagates experience from the past tasks to new ones. Finally, inspired by our theoretical analysis, we develop a new kernelbased meta-learning approach. We show that the proposed approach outperforms GBML with standard DNNs on the Omniglot dataset when the number of past tasks for meta-training is small. The code is available at https://github.com/ AI-secure/Meta-Neural-Kernel .Preprint. Under review.

show abstract

“…Connection to composite optimization. The proposed doubly variance reduction algorithm shares the same spirit with the variance reduced composite optimization problem considered in Zhang and Xiao (2019a); Hu et al (2020); Tran-Dinh et al (2020); Zhang and Xiao (2019b;c), but with two main di erences. Firstly, the objective function is di erent.…”

mentioning

confidence: 99%

On the Importance of Sampling in Training GCNs: Tighter Analysis and Variance Reduction

Cong

Ramezani

Mahdavi

2021

Preprint

View full text Add to dashboard Cite

Graph Convolutional Networks (GCNs) have achieved impressive empirical advancement across a wide variety of graph-related applications. Despite their great success, training GCNs on large graphs su ers from computational and memory issues. A potential path to circumvent these obstacles is sampling-based methods, where at each layer a subset of nodes is sampled. Although recent studies have empirically demonstrated the e ectiveness of sampling-based methods, these works lack theoretical convergence guarantees under realistic settings and cannot fully leverage the information of evolving parameters during optimization. In this paper, we describe and analyze a general doubly variance reduction schema that can accelerate any sampling method under the memory budget. The motivating impetus for the proposed schema is a careful analysis for the variance of sampling methods where it is shown that the induced variance can be decomposed into node embedding approximation variance (zeroth-order variance) during forward propagation and layerwise-gradient variance ( rst-order variance) during backward propagation. We theoretically analyze the convergence of the proposed schema and show that it enjoys an (1/ ) convergence rate. We complement our theoretical results by integrating the proposed schema in di erent sampling methods and applying them to di erent large real-world graphs. Code is public available at https://github.com/CongWeilin/SGCN.git.

show abstract

Biased Stochastic Gradient Descent for Conditional Stochastic Optimization

Cited by 7 publications

References 18 publications

Learning Deep Neural Networks under Agnostic Corrupted Supervision

Learning Deep Neural Networks under Agnostic Corrupted Supervision

Global Convergence and Generalization Bound of Gradient-Based Meta-Learning with Deep Neural Nets

On the Importance of Sampling in Training GCNs: Tighter Analysis and Variance Reduction

Contact Info

Product

Resources

About