2022
DOI: 10.1007/s00033-022-01716-w
|View full text |Cite
|
Sign up to set email alerts
|

A proof of convergence for stochastic gradient descent in the training of artificial neural networks with ReLU activation for constant target functions

Abstract: In this article we study the stochastic gradient descent (SGD) optimization method in the training of fully connected feedforward artificial neural networks with ReLU activation. The main result of this work proves that the risk of the SGD process converges to zero if the target function under consideration is constant. In the established convergence result the considered artificial neural networks consist of one input layer, one hidden layer, and one output layer (with $$d \in {\mathbb {N}}$$ … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1
1

Citation Types

0
5
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1

Relationship

0
6

Authors

Journals

citations
Cited by 7 publications
(5 citation statements)
references
References 32 publications
0
5
0
Order By: Relevance
“…The previous paper [24] contains comparable results for 1d shallow networks. Similar approximation results for gradient flow trained shallow 1d networks are in [33,31], with slightly different assumptions on the target f , more general probability weighted L 2 loss and an alternative proof technique. Other approximation and optimization guarantees rely on alternative optimizers.…”
Section: Literature Reviewmentioning
confidence: 55%
See 1 more Smart Citation
“…The previous paper [24] contains comparable results for 1d shallow networks. Similar approximation results for gradient flow trained shallow 1d networks are in [33,31], with slightly different assumptions on the target f , more general probability weighted L 2 loss and an alternative proof technique. Other approximation and optimization guarantees rely on alternative optimizers.…”
Section: Literature Reviewmentioning
confidence: 55%
“…Due to the over-parametrized regime, these optimization results achieve zero training error in discrete sample norms and are therefore not immediately compatible with the approximation literature. There are relatively few papers [1,21,42,15,24,26,30,23,45] that consider approximation and optimization simultaneously.…”
Section: Introductionmentioning
confidence: 99%
“…The previous paper [24] contains comparable results for 1d shallow networks. Similar approximation results for gradient flow trained shallow 1d networks are in [30,32], with slightly different assumptions on the target f , more general probability weighted L 2 loss and an alternative proof technique. Other approximation and optimization guarantees rely on alternative optimizers.…”
Section: Literature Reviewmentioning
confidence: 62%
“…• Gradient descent or gradient flow error bounds in continuous L 2 norms can be found in [30,32], and [17,37] The first set of papers uses more general L 2 (P) losses, weighted by a probability measure P of the training samples. For deep networks, they show that the loss converges to zero if the learning target f is piecewise polynomial and for shallow networks if the target is a increasing function.…”
Section: Literature Reviewmentioning
confidence: 99%
“…The way the validation set is divided may result in a large variance in the validation scores. In this case, the best practice is to use the K fold cross validation method [13] . This method divides the available data into K partitions (K is usually 4 or 5) and instantiates K identical models.…”
Section: Model Verificationmentioning
confidence: 99%