2021
DOI: 10.48550/arxiv.2111.05803
|View full text |Cite
Preprint
|
Sign up to set email alerts
|

Gradients are Not All You Need

Abstract: Differentiable programming techniques are widely used in the community and are responsible for the machine learning renaissance of the past several decades. While these methods are powerful, they have limits. In this short report, we discuss a common chaos based failure mode which appears in a variety of differentiable circumstances, ranging from recurrent neural networks and numerical physics simulation to training learned optimizers. We trace this failure to the spectrum of the Jacobian of the system under s… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
31
0

Year Published

2021
2021
2024
2024

Publication Types

Select...
5
4

Relationship

0
9

Authors

Journals

citations
Cited by 20 publications
(31 citation statements)
references
References 25 publications
0
31
0
Order By: Relevance
“…We used a progressively larger rollout for each round of training: 4, 8, and 12-step losses, corresponding to 1, 2, and 3-day rollouts, for the three rounds of training. Using even larger rollouts is enticing, but there are probably diminishing returns [Metz et al, 2021], and in practice we obtained only slightly worse results when using a 4-step loss throughout.…”
Section: Multi-step Lossmentioning
confidence: 77%
See 1 more Smart Citation
“…We used a progressively larger rollout for each round of training: 4, 8, and 12-step losses, corresponding to 1, 2, and 3-day rollouts, for the three rounds of training. Using even larger rollouts is enticing, but there are probably diminishing returns [Metz et al, 2021], and in practice we obtained only slightly worse results when using a 4-step loss throughout.…”
Section: Multi-step Lossmentioning
confidence: 77%
“…As an aside we note that, while it may be tempting to replace this heuristic with a a more end-to-end-learned approach, (i) you would still have to use human judgement to pick a metric to optimize (e.g. globe-averaged Z500 at a 10-day forecast horizon) and (ii) directly optimizing over tens of rollout steps might not be effective [Metz et al, 2021], even if you are able to fit the gradient into GPU memory.…”
Section: Discussionmentioning
confidence: 99%
“…We quite robustly addressed this by isolating nondimensionalized functions that were trivially carefully implemented to obtain the correct asymptotic result in all cases. For more discussions on both numerical and analytical issues with gradients, we refer the interested readers to [Johnson and Fedkiw 2022;Metz et al 2021].…”
Section: Discussionmentioning
confidence: 99%
“…But much of this progress is restricted to systems that rely on gradient descent, a highly effective optimization method when we provide it with a well-defined, differentiable objective function. But in areas such as artificial life, complex systems, computational biology, and even classical physics [17], much of the interesting behaviors we observe take place near the chaotic states, where a system is constantly transitioning between order and disorder. It can be argued that intelligence life and even civilization are all complex systems operating at the edge of chaos [3,15].…”
Section: Introductionmentioning
confidence: 99%