“…Additionally, f (k) (X; θ) i is the i-th row of f (k) (X; θ), m < n is the number of labeled nodes and D is some discriminator function, e.g., cross-entropy for classification, squared error for regression. We may then optimize In prior work (Klicpera et al, 2018;Ma et al, 2020;Pan et al, 2021;Yang et al, 2021;Zhang et al, 2020;Zhu et al, 2021), this type of bilevel optimization framework has been adopted to either unify and explain existing GNN models, or motivate alternatives by varying the structure of P (k) . However, in all cases to date that we are aware of, it has been assumed that f (X; θ) is differentiable, typically either a linear function or an MLP.…”