A non-intrusive parallel-in-time approach for simultaneous optimization with unsteady PDEs

Günther, Stefanie; Gauger, Nicolas R.; Schroder, Jacob B.

doi:10.1080/10556788.2018.1504050

Cited by 21 publications

(13 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This is referred to as the cross-over point. However, the speedups observed can be large, e.g., the work [32] showed a speedup of 19x for a model optimization problem while using an additional 256 processors in time.…”

Section: Multigrid Across Layers For Forward Propagationmentioning

confidence: 99%

“…The MGRIT iterator has been shown to be a contraction in many settings for linear, nonlinear, parabolic, and hyperbolic problems, although hyperbolic problems tend to be more difficult (e.g., [17,19,32,21]). Upon convergence, the limit fixed-point U = MGRIT(A, U, θ, G) will satisfy the discrete network state equations as in (2.15)-(2.16), since MGRIT solves the same underlying problem.…”

Section: Mgrit Using Full Approximation Scheme (Fas)mentioning

confidence: 99%

“…They aim at solving the optimization problem in an all-at-once fashion, updating the optimization parameters simultaneously while solving for the time-dependent system state. Here, we apply the One-shot method [9,32] to solve the training problem simultaneously for the network state and parameters. In this approach, network parameter updates are based on inexact gradient information resulting from early stopping of the layer-parallel multigrid iteration.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Layer-Parallel Training of Deep Residual Neural Networks

Günther¹,

Ruthotto²,

Schroder³

et al. 2020

SIAM Journal on Mathematics of Data Science

View full text Add to dashboard Cite

Residual neural networks (ResNets) are a promising class of deep neural networks that have shown excellent performance for a number of learning tasks, e.g., image classification and recognition. Mathematically, ResNet architectures can be interpreted as forward Euler discretizations of a nonlinear initial value problem whose time-dependent control variables represent the weights of the neural network. Hence, training a ResNet can be cast as an optimal control problem of the associated dynamical system. For similar time-dependent optimal control problems arising in engineering applications, parallel-in-time methods have shown notable improvements in scalability. This paper demonstrates the use of those techniques for efficient and effective training of ResNets. The proposed algorithms replace the classical (sequential) forward and backward propagation through the network layers by a parallel nonlinear multigrid iteration applied to the layer domain. This adds a new dimension of parallelism across layers that is attractive when training very deep networks. From this basic idea, we derive multiple layer-parallel methods. The most efficient version employs a simultaneous optimization approach where updates to the network parameters are based on inexact gradient information in order to speed up the training process. Using numerical examples from supervised classification, we demonstrate that the new approach achieves similar training performance to traditional methods, but enables layerparallelism and thus provides speedup over layer-serial methods through greater concurrency. in particular deep residual networks (ResNets) [36], have been breaking human records in various contests and are now central to technology such as image recognition [38,43,45] and natural language processing [6,15,41].The abstract goal of machine learning is to model a function f :for input-output pairs (y, c) from a certain data set Y × C. Depending on the nature of inputs and outputs, the task can be regression or classification. When outputs are available for all samples, parts of the samples, or are not available, this formulation describes supervised, semi-supervised, and unsupervised learning, respectively. The function f can be thought of as an interpolation or approximation function.In deep learning, the function f involves a DNN that aims at transforming the input data using many layers. The layers successively apply affine transformations and element-wise nonlinearities that are parametrized by the network parameters θ. The training problem consists of finding the parameters θ such that (1.1) is satisfied for data elements from a training data set, but also holds for previously unseen data from a validation data set, which has not been used during training. The former objective is commonly modeled as an expected loss and optimization techniques are used to find the parameters that minimize the loss.Despite rapid methodological developments, compute times for training state-of-the-art DNNs can still be prohibitive, measured in the orde...

show abstract

Section: Multigrid Across Layers For Forward Propagationmentioning

confidence: 99%

Section: Mgrit Using Full Approximation Scheme (Fas)mentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Layer-Parallel Training of Deep Residual Neural Networks

Günther¹,

Ruthotto²,

Schroder³

et al. 2020

SIAM Journal on Mathematics of Data Science

View full text Add to dashboard Cite

show abstract

“…Aside from being used to accelerate direct studies, parallel-in-time methods have also been extended to optimization studies. In the work of Günther et al [26] and Günther et al [27] the XBraid library, which utilizes a multigrid reduction-in-time technique [14], is extended to accelerate optimization studies. Likewise, the PFASST algorithm [10] has also been used for PDE optimization [23,24].…”

Section: Introductionmentioning

confidence: 99%

A parallel-in-time approach for accelerating direct-adjoint studies

Skene

Eggl

Schmid

2021

Journal of Computational Physics

View full text Add to dashboard Cite

Parallel-in-time methods are developed to accelerate the direct-adjoint looping procedure. Particularly, we utilize the Paraexp algorithm, previously developed to integrate equations forward in time, to accelerate the direct-adjoint looping that arises from gradient-based optimization. We consider both linear and nonlinear governing equations and exploit the linear, time-varying nature of the adjoint equations. Gains in efficiency are seen across all cases, showing that a Paraexp based parallel-in-time approach is feasible for the acceleration of direct-adjoint studies. This signifies a possible approach to further increase the run-time performance for optimization studies that either cannot be parallelized in space or are at their limit of efficiency gains for a parallel-in-space approach.

show abstract

“…While they report excellent speedups and linear scaling up to 50 processors and show convergence if sufficiently small step sizes for updating the control are used, it is unclear how to automatically select such a step size. Alternatively, space-time parallel multigrid methods are applied to adjoint gradient computation and simultaneous optimization [24,25] within the XBraid software library [4]. XBraid provides a non-intrusive framework adding time-parallelism to existing serial time stepping codes, and using simultaneous instead of reduced space optimization, a speedup of 19 using 256 time processors has been reported.…”

Section: Introductionmentioning

confidence: 99%

An Efficient Parallel-in-Time Method for Optimization with Parabolic PDEs

Götschel

Minion

2019

SIAM J. Sci. Comput.

View full text Add to dashboard Cite

To solve optimization problems with parabolic PDE constraints, often methods working on the reduced objective functional are used. They are computationally expensive due to the necessity of solving both the state equation and a backward-in-time adjoint equation to evaluate the reduced gradient in each iteration of the optimization method. In this study, we investigate the use of the parallel-in-time method PFASST in the setting of PDE-constrained optimization. In order to develop an efficient fully timeparallel algorithm we discuss different options for applying PFASST to adjoint gradient computation, including the possibility of doing PFASST iterations on both the state and adjoint equations simultaneously. We also explore the additional gains in efficiency from reusing information from previous optimization iterations when solving each equation. Numerical results for both a linear and a non-linear reaction-diffusion optimal control problem demonstrate the parallel speedup and efficiency of different approaches.

show abstract

A non-intrusive parallel-in-time approach for simultaneous optimization with unsteady PDEs

Cited by 21 publications

References 36 publications

Layer-Parallel Training of Deep Residual Neural Networks

Layer-Parallel Training of Deep Residual Neural Networks

A parallel-in-time approach for accelerating direct-adjoint studies

An Efficient Parallel-in-Time Method for Optimization with Parabolic PDEs

Contact Info

Product

Resources

About