2020
DOI: 10.1016/j.neunet.2019.08.028
|View full text |Cite
|
Sign up to set email alerts
|

An analysis of training and generalization errors in shallow and deep networks

Abstract: This paper is motivated by an open problem around deep networks, namely, the apparent absence of overfitting despite large over-parametrization which allows perfect fitting of the training data. In this paper, we analyze this phenomenon in the case of regression problems when each unit evaluates a periodic activation function. We argue that the minimal expected value of the square loss is inappropriate to measure the generalization error in approximation of compositional functions in order to take full advanta… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
6
0
1

Year Published

2020
2020
2023
2023

Publication Types

Select...
4
1
1

Relationship

1
5

Authors

Journals

citations
Cited by 14 publications
(7 citation statements)
references
References 18 publications
(32 reference statements)
0
6
0
1
Order By: Relevance
“…In this range the Gaussian kernel is minimal at r = 3 and r = 4 for the the binary and normalized versions, respectively; then as the dimension increases so does the classification accuracy, reaching 50% by r = 15 and 60% for q = 20. Interestingly, the performance of the Grassman kernel stays constant for r ∈ [1,20] with the normalized version maintaining an error of 15% and the binary at 18%. For r > 9 the Laplacian kernel obtained roughly the same performance, with the binary feature out performing by 3%, and the Laplacian kernel outperformed the Grassman kernel at larger values of r. We attribute this drop-off in accuracy for increasing r, due to the fact the as dimension of the manifold increases the data points become more spread apart (less dense) and a poorer approximation of the labeling function is achieved.…”
Section: Sensitivity To Feature Dimensionmentioning
confidence: 99%
See 3 more Smart Citations
“…In this range the Gaussian kernel is minimal at r = 3 and r = 4 for the the binary and normalized versions, respectively; then as the dimension increases so does the classification accuracy, reaching 50% by r = 15 and 60% for q = 20. Interestingly, the performance of the Grassman kernel stays constant for r ∈ [1,20] with the normalized version maintaining an error of 15% and the binary at 18%. For r > 9 the Laplacian kernel obtained roughly the same performance, with the binary feature out performing by 3%, and the Laplacian kernel outperformed the Grassman kernel at larger values of r. We attribute this drop-off in accuracy for increasing r, due to the fact the as dimension of the manifold increases the data points become more spread apart (less dense) and a poorer approximation of the labeling function is achieved.…”
Section: Sensitivity To Feature Dimensionmentioning
confidence: 99%
“…In particular, it is shown in [23] that substituting the ReLU activation function by a polynomial approximation exhibits the same behavior as the original network. In [20] we have analyzed the question from the point of view of approximation theory so as to examine the intrinsic features of the data (rather than focusing on specific training algorithms) that allow this phenomenon. A crucial role in the proofs of the results in that paper is played by a highly localized kernel.…”
Section: Related Workmentioning
confidence: 99%
See 2 more Smart Citations
“…• Our bounds are in the uniform norm. We have argued in [25] that the usual measurement of generalization error using the expected value of the least square loss is not applicable for approximation theory for deep networks; one has to use the uniform approximation to take full advantage of the compositional structure.…”
Section: )mentioning
confidence: 99%