We discuss how the interpretation of gradient descent methods in terms of a Moreau-Yosida-like regularization is particularly useful to study gradient based approaches to optimization. We show that many variants of GD that have been purposely developed or widely used in Machine Learning, such as gradient descent with momentum, ADAGRAD, RMSprop ADAM, can be easily obtained and interpreted within this framework.
We further push this idea by proposing a variant of the Yosida regularization which make use of model outputs to deform the notion of closeness in the parameter space. In particular we argue how our output-regularization approach can be employed to tackle forgetting in continual learning problems.
Join at: imt.lu/seminar