2022
DOI: 10.1016/j.jco.2022.101646
|View full text |Cite
|
Sign up to set email alerts
|

A proof of convergence for gradient descent in the training of artificial neural networks for constant target functions

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
3
1

Citation Types

2
8
0

Year Published

2022
2022
2024
2024

Publication Types

Select...
5
1
1

Relationship

3
4

Authors

Journals

citations
Cited by 16 publications
(10 citation statements)
references
References 10 publications
2
8
0
Order By: Relevance
“…Thus, in the experiment, we did not enhance data for initialization. Second, we chose four EfficientNet models, considering that EfficientNet has been proved in numerous literature reviews to have a greater advantage in image deep learning, compared with Densen-Net, ResNet [18], EfficientNetB3, EfficientNetB5, EfficientNetB6, and EfficientNetB7 [40]. EfficientNetB0 to EfficientNetB7 included seven blocks.…”
Section: Datasets and Experimental Settingsmentioning
confidence: 99%
See 1 more Smart Citation
“…Thus, in the experiment, we did not enhance data for initialization. Second, we chose four EfficientNet models, considering that EfficientNet has been proved in numerous literature reviews to have a greater advantage in image deep learning, compared with Densen-Net, ResNet [18], EfficientNetB3, EfficientNetB5, EfficientNetB6, and EfficientNetB7 [40]. EfficientNetB0 to EfficientNetB7 included seven blocks.…”
Section: Datasets and Experimental Settingsmentioning
confidence: 99%
“…Swish has an advantage in big data and deeper complex networks and thus performs more efficiently than others [17]. Proposed by Misra et al, Mish is also a nonmonotonic differentiable function, which has a lower bound but no upper bound [18]. Mish is almost smooth at any point on the curve, allowing the transformation of more valid information into the model to improve accuracy and generalization performance.…”
Section: Introductionmentioning
confidence: 99%
“…Moreover, non-global local minimum points could be found in the risk landscape of ANNs with one hidden layer and ReLU activation in special student-teacher setups with the probability distribution of the input data given by the normal distribution (see Safran & Shamir [31]). In other cases, where the target function has a very simple form, the critical points of the risk landscape are fully characterized and thus all local minimum points are known (see Cheridito et al [2,Corollary 2.15], Cheridito et al [3], and Jentzen & Riekert [17,Corollary 2.11]). Additionally, in the case of ANNs with linear activation and finitely many training data it was shown that all local minimum points of the risk function corresponding to the squared error loss are global minimum points (cf.…”
Section: Introductionmentioning
confidence: 99%
“…To describe a GF trajectory, we need to specify an appropriate generalized gradient function in Theorem 1.2 as the risk function is not differentiable in the case of ANNs with ReLU activation (due to the fact that the ReLU activation function R x → max{x, 0} ∈ R fails to be differentiable in the origin). As in [17] (cf., e.g., also Cheridito et al [2]) we accomplish this by means of an approximation procedure in which the ReLU activation function R x → max{x, 0} ∈ R is approximated through appropriate continuously differentiable functions whose derivatives converge pointwise to the left-derivative of the ReLU activation function; see (1.2) in Theorem 1.2. We now present the precise statement of Theorem 1.2.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation