
Gradient Descent Methods

# Gradient Descent Methods

This tour explores the use of gradient descent method for unconstrained and constrained optimization of a smooth function

## Installing toolboxes and setting up the path.

You need to download the following files: signal toolbox and general toolbox.

You need to unzip these toolboxes in your working directory, so that you have toolbox_signal and toolbox_general in your directory.

For Scilab user: you must replace the Matlab comment '%' by its Scilab counterpart '//'.

Recommandation: You should create a text file named for instance numericaltour.sce (in Scilab) or numericaltour.m (in Matlab) to write all the Scilab/Matlab command you want to execute. Then, simply run exec('numericaltour.sce'); (in Scilab) or numericaltour; (in Matlab) to run the commands.

Execute this line only if you are using Matlab.

getd = @(p)path(p,path); % scilab users must *not* execute this


Then you can add the toolboxes to the path.

getd('toolbox_signal/');
getd('toolbox_general/');


## Gradient Descent for Unconstrained Problems

We consider the problem of finding a minimum of a function $$f$$, hence solving $\umin{x \in \RR^d} f(x)$ where $$f : \RR^d \rightarrow \RR$$ is a smooth function.

Note that the minimum is not necessarily unique. In the general case, $$f$$ might exhibit local minima, in which case the proposed algorithms is not expected to find a global minimizer of the problem. In this tour, we restrict our attention to convex function, so that the methods will converge to a global minimizer.

The simplest method is the gradient descent, that computes $x^{(k+1)} = x^{(k)} - \tau_k \nabla f(x^{(k)}),$ where $$\tau_k>0$$ is a step size, and $$\nabla f(x) \in \RR^d$$ is the gradient of $$f$$ at the point $$x$$, and $$x^{(0)} \in \RR^d$$ is any initial point.

In the convex case, if $$f$$ is of class $$C^2$$, in order to ensure convergence, the step size should satisfy $0 < \tau_k < \frac{2}{ \sup_x \norm{Hf(x)} }$ where $$Hf(x) \in \RR^{d \times d}$$ is the Hessian of $$f$$ at $$x$$ and $$\norm{\cdot}$$ is the spectral operator norm (largest eigenvalue).

## Gradient Descent in 2-D

We consider a simple problem, corresponding to the minimization of a 2-D quadratic form $f(x) = \frac{1}{2} \pa{ x_1^2 + \eta x_2^2, }$ where $$\eta>0$$ controls the anisotropy, and hence the difficulty, of the problem.

Anisotropy parameter $$\eta$$.

eta = 10;


Function $$f$$.

f = @(x)( x(1)^2 + eta*x(2)^2 ) /2;


Background image of the function.

t = linspace(-.7,.7,101);
[u,v] = meshgrid(t,t);
F = ( u.^2 + eta*v.^2 )/2 ;


Display the function as a 2-D image.

clf; hold on;
imagesc(t,t,F); colormap jet(256);
contour(t,t,F, 20, 'k');
axis off; axis equal;


Gradient.

Gradf = @(x)[x(1); eta*x(2)];


The step size should satisfy $$\tau_k < 2/\eta$$. We use here a constrant step size.

tau = 1.8/eta;


Exercice 1: (check the solution) Perform the gradient descent using a fixed step size $$\tau_k=\tau$$. Display the decay of the energy $$f(x^{(k)})$$ through the iteration. Save the iterates so that X(:,k) corresponds to $$x^{(k)}$$.

exo1;


Display the iterations.

clf; hold on;
imagesc(t,t,F); colormap jet(256);
contour(t,t,F, 20, 'k');
h = plot(X(1,:), X(2,:), 'k.-');
set(h, 'LineWidth', 2);
set(h, 'MarkerSize', 15);
axis off; axis equal;


Exercice 2: (check the solution) Display the iteration for several different step sizes.

exo2;


## Gradient and Divergence of Images

Local differential operators like gradient, divergence and laplacian are the building blocks for variational image processing.

Load an image $$x_0 \in \RR^N$$ of $$N=n \times n$$ pixels.

n = 256;
x0 = rescale( load_image('lena',n) );


Display it.

clf;
imageplot(x0);


For a continuous function $$g$$, the gradient reads $\nabla g(s) = \pa{ \pd{g(s)}{s_1}, \pd{g(s)}{s_2} } \in \RR^2.$ (note that here, the variable $$s$$ denotes the 2-D spacial position).

We discretize this differential operator on a discrete image $$x \in \RR^N$$ using first order finite differences. $(\nabla x)_i = ( x_{i_1,i_2}-x_{i_1-1,i_2}, x_{i_1,i_2}-x_{i_1,i_2-1} ) \in \RR^2.$ Note that for simplity we use periodic boundary conditions.

Compute its gradient, using finite differences.

grad = @(x)cat(3, x-x([end 1:end-1],:), x-x(:,[end 1:end-1]));


One thus has $$\nabla : \RR^N \mapsto \RR^{N \times 2}.$$

v = grad(x0);


One can display each of its components.

clf;
imageplot(v(:,:,1), 'd/dx', 1,2,1);
imageplot(v(:,:,2), 'd/dy', 1,2,2);


One can also display it using a color image.

clf;
imageplot(v);


One can display its magnitude $$\norm{(\nabla x)_i}$$, which is large near edges.

clf;
imageplot( sqrt( sum3(v.^2,3) ) );


The divergence operator maps vector field to images. For continuous vector fields $$v(s) \in \RR^2$$, it is defined as $\text{div}(v)(s) = \pd{v_1(s)}{s_1} + \pd{v_2(s)}{s_2} \in \RR.$ (note that here, the variable $$s$$ denotes the 2-D spacial position). It is minus the adjoint of the gadient, i.e. $$\text{div} = - \nabla^*$$.

It is discretized, for $$v=(v^1,v^2)$$ as $\text{div}(v)_i = v^1_{i_1+1,i_2} - v^1_{i_1,i_2} + v^2_{i_1,i_2+1} - v^2_{i_1,i_2} .$

div = @(v)v([2:end 1],:,1)-v(:,:,1) + v(:,[2:end 1],2)-v(:,:,2);


The Laplacian operatore is defined as $$\Delta=\text{div} \circ \nabla = -\nabla^* \circ \nabla$$. It is thus a negative symmetric operator.

delta = @(x)div(grad(x));


Display $$\Delta x_0$$.

clf;
imageplot(delta(x0));


Check that the relation $$\norm{\nabla x} = - \dotp{\Delta x}{x}.$$

dotp = @(a,b)sum(a(:).*b(:));
fprintf('Should be 0: %.3i\n', dotp(grad(x0), grad(x0)) + dotp(delta(x0),x0) );

Should be 0: 000


## Gradient Descent in Image Processing

We consider now the problem of denoising an image $$y \in \RR^d$$ where $$d = n \times n$$ is the number of pixels ($$n$$ being the number of rows/columns in the image).

Add noise to the original image, to simulate a noisy image.

sigma = .1;
y = x0 + randn(n)*sigma;


Display the noisy image $$y$$.

clf;
imageplot(clamp(y));


Denoising is obtained by minimizing the following functional $\umin{x \in \RR^d} f(x) = \frac{1}{2} \norm{y-x}^2 + \la J_\epsilon(x)$ where $$J_\epsilon(x)$$ is a smoothed total variation of the image. $J_\epsilon(x) = \sum_i \norm{ (G x)_i }_{\epsilon}$ where $$(Gx)_i \in \RR^2$$ is an approximation of the gradient of $$x$$ at pixel $$i$$ and for $$u \in \RR^2$$, we use the following smoothing of the $$L^2$$ norm in $$\RR^2$$ $\norm{u}_\epsilon = \sqrt{ \epsilon^2 + \norm{u}^2 },$ for a small value of $$\epsilon>0$$.

The gradient of the functional read $\nabla f(x) = x-y + \lambda \nabla J_\epsilon(x)$ where the gradient of the smoothed TV norm is $\nabla J_\epsilon(x)_i = G^*( u ) \qwhereq u_i = \frac{ (G x)_i }{\norm{ (G x)_i }_\epsilon}$ where $$G^*$$ is the adjoint operator of $$G$$ which corresponds to minus a discretized divergence.

Value for $$\lambda$$.

lambda = .3/5;


Value for $$\epsilon$$.

epsilon = 1e-3;


TV norm.

NormEps = @(u,epsilon)sqrt(epsilon^2 + sum(u.^2,3));
J = @(x,epsilon)sum(sum(NormEps(grad(x),epsilon)));


Function $$f$$ to minimize.

f = @(y,x,epsilon)1/2*norm(x-y,'fro')^2 + lambda*J(x,epsilon);


Gradient of $$J_\epsilon$$. Note that div implement $$-G^*$$.

Normalize = @(u,epsilon)u./repmat(NormEps(u,epsilon), [1 1 2]);
GradJ = @(x,epsilon)-div( Normalize(grad(x),epsilon) );


Gradient of the functional.

Gradf = @(y,x,epsilon)x-y+lambda*GradJ(x,epsilon);


The step size should satisfy $0 < \tau_k < \frac{2}{ 1 + 4 \lambda / \epsilon }.$ Here we use a slightly larger step size, which still work in practice.

tau = 1.8/( 1 + lambda*8/epsilon );
tau = tau*4;


Exercice 3: (check the solution) Implement the gradient descent. Monitor the decay of $$f$$ through the iterations.

exo3;


Display the resulting denoised image.

clf;
imageplot(clamp(x));


## Constrained Optimization Using Projected Gradient Descent

We consider a linear imaging operator $$\Phi : x \mapsto \Phi(x)$$ that maps high resolution images to low dimensional observations. Here we consider a pixel masking operator, that is diagonal over the spacial domain.

To emphasis the effect of the TV functional, we use a simple geometric image.

n = 64;
name = 'square';
x0 = load_image(name,n);


We consider here the inpainting problem. This simply corresponds to a masking operator. Here we remove the central part of the image.

a = 4;
Lambda = ones(n);
Lambda(end/2-a:end/2+a,:) = 0;


Masking operator $$\Phi$$. Note that it is symmetric, i.e. $$\Phi^*=\Phi$$

Phi  = @(x)x.*Lambda;
PhiS = @(x)Phi(x);


Noiseless observations $$y=\Phi x_0$$.

y = Phi(x0);


Display.

clf;
imageplot(x0, 'Original', 1,2,1);
imageplot(y, 'Damaged', 1,2,2);


We want to solve the noiseless inverse problem $$y=\Phi f$$ using a total variation regularization: $\umin{ y=\Phi x } J_\epsilon(x).$ We use the following projected gradient descent $x^{(k+1)} = \text{Proj}_{\Hh}( x^{(k)} - \tau_k \nabla J_{\epsilon}(x^{(k)}) )$ where $$\text{Proj}_{\Hh}$$ is the orthogonal projection on the set of linear constraint $$\Phi x = y$$, and is easy to compute for inpainting

ProjH = @(x,y) x + PhiS( y - Phi(x) );


Exercice 4: (check the solution) Display the evolution of the inpainting process.

exo4;


Exercice 5: (check the solution) Try with several values of $$\epsilon$$.

exo5;