“…Rather than standard tensor notation, we employ matrix-vector product descriptions to stay close to PDEconstrained optimization literature. We block-vectorize states and Lagrangian multipliers, while weight/parameter tensors are flattened into block-matrices, see [21,18,4]. To keep notation compact, we focus the ResNet [10] with timestep h. The network state y j at layer j is given by y j = y j−1 − hf (K j y j−1 ).…”