Do you know about the less famous sibling of the gradient descent step?
It’s a proximal step, and here’s how it works.
Suppose you have a point z and want to take a step toward minimizing a function f(x). We can solve a mini-optimization problem that has both f(x) and a penalty for how far we move from z. This is known as a proximal operation. Specifically, the proximal solves
min α·f (x ) + ½ |x − z|²,
where α is the step size. The solution for various choices of α is illustrated in the animation above. The x-coordinate of the circle at the minimum of the white curve gives the proximal step.
Although evaluating proximal steps requires solving a mini-optimization problem,
in many important instances these subproblems have explicit formulas, even if the function itself is not smooth.
For an L1 norm, it’s a soft shrink.
For a hard constraint, it’s a projection.
For a quadratic, it’s an average.
When solving high-dimensional problems, these simple and cheap-to-evaluate formulas can be quite useful. Often, they are a single piece of the updates for a more sophisticated optimization algorithm (e.g. in Davis-Yin splitting, ADMM, DRS).
For a next step on learning more, I recommend Lieven Vandenberghe’s slides:
Cheers,
Howard
p.s. For readers familiar with ODEs, gradient steps can be viewed as Foward Euler and proximal steps as Backward Euler.