“…Some previous literature finds that Adam is more vulnerable to sharp minima than SGD [65], which results in worse generalization ability [22,28,68]. Some following works [10,52,69,76] propose generalizable optimizers to address this problem. However, it can be a trade-off between generalization ability and convergence speed [19,38,48,69,76].…”