Gradient descent and stochastic gradient descent

Author

Ryan Giordano

Goals

Introduce methods for optimizing empirical risk in practice
- Gradient descent and stochastic gradient descent
- Behavior on quadratic objectives, relationship between step size and eigenvalues
- Automatic differentiation
Learn what “back–propagation” actually means

\[ \def\gv{\boldsymbol{g}} \def\H{\boldsymbol{H}} \def\boldsymbol{\theta}{\boldsymbol{\theta}} \def\thetavstar{\boldsymbol{\theta}^{\star}} \]

Reading

The primary reading for this section is Chatper 5 of Hardt and Recht (2021). These notes are only a list of the topics we will discuss in lecture, and function primarily as a reminder to the lecturer what to cover. Students are advised to just read the textbook.

Classes of objective functions

A hierarchy of hardness:
- A single global minimum exists
- Many connected global minima exist
- Many discconnected nearly global minima exist
- Many disconnected local minima exist

The most robust theory is for convex functions, which can be understood in terms of quadratics. So we’ll gain intuition by studying the problem

\[ f(\boldsymbol{\theta}) = a+ \gv^\intercal\boldsymbol{\theta}+ \frac{1}{2}\boldsymbol{\theta}^\intercal\H \boldsymbol{\theta}. \]

Note the similarity to a second–order Taylor expansion.

Let the eigenvalues of $\H$ be $\lambda_1 \ge \ldots \lambda_D$, with corresponding eigenvectors $\boldsymbol{v}_d$. Contrast four cases:

$\lambda_D > 0$: Convex
$\lambda_1 < 0$: Concave
$\lambda_1 > 0 > \lambda_D$: Saddle point
$\lambda_1 > 0 = \lambda_{D-1} = \lambda_d$: Multiple minima

Gradient descent

If we want to find a minimum of $f(\boldsymbol{\theta})$, a reasonable strategy:

\[ \boldsymbol{\theta}^{t+1} = \boldsymbol{\theta}^{t} - \alpha \left. \frac{\partial f(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \right|_{\boldsymbol{\theta}^{t}} \quad\textrm{for}\quad\alpha > 0. \]

For our quadratic case, this looks like

\[ \frac{\partial f(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \H \boldsymbol{\theta}- \gv \quad\Rightarrow\quad \boldsymbol{\theta}^{t+1} = \boldsymbol{\theta}^{t} - \alpha (\H \boldsymbol{\theta}^{t} - \gv) = (\boldsymbol{I}_{\,}- \alpha \H) \boldsymbol{\theta}^{t} - \alpha \gv. \]

If we write

\[ \Delta^{t} = \boldsymbol{\theta}^t - \thetavstar, \]

where $\left. \frac{\partial f(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} \right|_{\thetavstar} = \boldsymbol{0}$, then

\[ \Delta^{t+ 1} = (\boldsymbol{I}_{\,}- \alpha \H) \Delta^{t} \quad\textrm{$\Rightarrow$}\quad \Delta^{t} = (\boldsymbol{I}_{\,}- \alpha \H)^{t} \Delta^{0}, \]

where

\[ (\boldsymbol{I}_{\,}- \alpha \H)^{t} = \underbrace{(\boldsymbol{I}_{\,}- \alpha \H) \ldots (\boldsymbol{I}_{\,}- \alpha \H)}_{t\rm\ times}. \]

Note that the eigenvalues and eigenvectors of $(\boldsymbol{I}_{\,}- \alpha \H)$ are inherited from $\H$:

\[ (\boldsymbol{I}_{\,}- \alpha \H) \boldsymbol{v}_d = \boldsymbol{v}_d - \alpha \lambda_d \boldsymbol{v}_d = (1 - \alpha \lambda_d) \boldsymbol{v}_d. \]

Define $\gamma_d := (1 - \alpha \lambda_d)$, and write out $\Delta^0 = \sum_{d=1}^D a_d \boldsymbol{v}_d$. With random initialization $a_d \ne 0$ with high probability. Then

\[ \Delta^t= \sum_{d=1}^D a_d \gamma_d^t \boldsymbol{v}_d, \]

whose time evolution is deterined entirely by $\gamma_d$. Some cases:

Suppose convexity, and $\alpha < 0$ (gradient ascent)
Suppose $\alpha > 0$ but $\lambda_D < 0$ (saddle point)
Suppose convex but $\alpha > 2 / \lambda_1$
Suppose convex and $\alpha = 1 / \lambda_1$. What happens to $\boldsymbol{v}_1$? What happens to $\boldsymbol{v}_D$?

Stochastic gradient descent

Returning to empirical risk minimization, recall that we are minimizing functions of the form

\[ f(\theta) = \frac{1}{N} \sum_{n=1}^N\mathscr{L}(z_n; \theta) \quad\Rightarrow\quad \frac{\partial f(\theta)}{\partial \theta} = \frac{1}{N} \sum_{n=1}^N\frac{\partial \mathscr{L}(z_n; \theta)}{\partial \theta} \]

A single gradient step requires computing $N$ terms, which can be very costly in large datasets. Furthermore, the whole idea of empirical risk minimization is that we should get a similar answer even if $N$ takes somewhat different values, which means that any individual datapoint should not have much influence on the result. Surely we don’t need to do order $N$ computation at each step? In fact, we don’t — we can show that you can do well by ranomly subsampling the data, even to the extreme of sampling a single datapoint at each step, as long as you choose the step sizes carefully.

Stochastic gradient descent (SGD) is designed to minimize a function of the form

\[ f(\theta) = \mathbb{E}_{\,}\left[f(z; \theta)\right] \]

using draws $g(z; \theta)$ satisfying $\mathbb{E}_{\,}\left[g(z; \theta)\right] = \frac{\partial f(\theta)}{\partial \theta}$. Note that it suffices to take $g(z; \theta) = \partial f(z; \theta) / \partial \theta$ (as long as you can exchange integration and differentiation), but there are many other choices you could make.

Stochastic gradient descent operates by taking steps

\[ \theta^{t+1} = \theta^{t} - \alpha_tg(z; \theta). \]

As we will see shortly, analysis of SGD strongly motivates taking a step size $\alpha_t$ that depends on the iterate, in contrast to our analysis of deterministic gradient descent.

Finding the sample mean

Imagine we’re trying to minimize the function

\[ f(\theta) = \frac{1}{2} \frac{1}{N} \sum_{n=1}^N(y_n - \theta)^2 \quad\textrm{so}\quad \frac{\partial f(\theta)}{\partial \theta} = -\frac{1}{N} \sum_{n=1}^N(y_n - \theta) \]

This can be viewed as an expectation over the empirical distribution assigning mass $1/N$ to each datapoint $y_n$. Given that, we can form a gradient descent algorithm by

Randomly sampling an index, $n^t$ from $1,\ldots,N$
Computing $g(z_n; \theta) = -(y_n - \theta)$
Setting $\alpha_t= 1/T$
Applying the SGD update.

Let’s suppose that $N$ is very large, so that we can effectively treat our random samples as distinct, since we’re unlikely to see repeats in the first few iterations. And without loss of generality, let’s assume that $n^1 = 1$, $n^2 = 2$, and so on, so that I don’t have to keep writing superscripts. (This is just a random relabeling of the original dataset, under the assumption that there are no repeats.)

What does this algorithm do? Let’s take $\theta^0 = 0$. Then

\[ \begin{aligned} \theta^1 ={}& \theta^0 + \frac{1}{1}(y_1 - \theta^0) ={} y_1 \\ \theta^2 ={}& \theta^1 + \frac{1}{2}(y_2 - \theta^1) \\={}& y_1 + \frac{1}{2}(y_2 - y_1) \\={}& \frac{1}{2}(y_1 + y_2) \\ \theta^3 ={}& \theta^2 + \frac{1}{2}(y_3 - \theta^2) \\={}& \frac{1}{2}(y_1 + y_1) + \frac{1}{3}(y_3 - \frac{1}{2}(y_1 + y_2)) \\={}& \frac{1}{3}(y_1 + y_2 + y_3) \\ \end{aligned} \]

In general, if we have $\theta^t= \frac{1}{t} \sum_{n=1}^ty_n$, then inductively

\[ \begin{aligned} \theta^{t+ 1} ={}& \theta^t+ \frac{1}{t+ 1} (y_{t+ 1} - \theta^t) \\={}& \left(1 - \frac{1}{t+ 1}\right) \theta^t- \frac{1}{t+ 1} y_{t+ 1} \\={}& \frac{t}{t+ 1} \frac{1}{t} \sum_{n=1}^ty_n - \frac{1}{t+ 1} y_{t+ 1} \\={}& \frac{1}{t+ 1} \sum_{n=1}^{t+ 1} y_n. \end{aligned} \]

Astonishingly, $\theta^t$ is precisely the same as what we would have computed if we had directly used gradient descent to minimize the objective function

\[ \hat{f}(\theta) = \frac{1}{2} \frac{1}{t} \sum_{n=1}^{t} (y_n - \theta)^2. \]

But if we had to run gradient descent for $M$ steps, we would have performed $M \times t$ gradient evaluations, whereas with SGD we achieved the same minimizer with only $t$ gradient evaluations. Intuitively, this is the source of the computational advantage of SGD — you get similar randomness guarantees, but you touch each sample only once.

This simple example also motivates the classic SGD step size schedule of $\alpha_t= 1 / t$. In general, optimization theory tell us that SGD will converge whenever

\[ \sum_{t= 1}^{T} \alpha_t\rightarrow \infty \quad\textrm{and}\quad \sum_{t= 1}^{T} \alpha_t^2 \rightarrow C < \infty. \]

You can check that $\alpha_t= 1 / t$ satisfies these conditions.

Minimizing a quadratic with a fixed step size

Recall again our quadratic example, where we showed that

\[ \frac{\partial f(\boldsymbol{\theta})}{\partial \boldsymbol{\theta}} = \H \boldsymbol{\theta}- \gv. \]

Suppose we update with the stochastic gradient

\[ g(\boldsymbol{\theta}) = \H \boldsymbol{\theta}- \gv + \varepsilon, \]

for IID mean zero $\varepsilon$ with variance $\sigma^2$. Then

\[ \boldsymbol{\theta}^{t+1} = \boldsymbol{\theta}^{t} - \alpha (\H \boldsymbol{\theta}^{t} - \gv + \varepsilon^t) = (\boldsymbol{I}_{\,}- \alpha \H) \boldsymbol{\theta}^{t} - \alpha \gv - \alpha \varepsilon^t. \]

Again subtracting $\thetavstar$ and $\alpha (\H \thetavstar - \gv)$ from both sides,

\[ \boldsymbol{\theta}^{t+1} - \thetavstar = (\boldsymbol{I}_{\,}- \alpha \H) (\boldsymbol{\theta}^{t} - \thetavstar) + \alpha \varepsilon^t. \]

The noise is independent of all previous $\boldsymbol{\theta}^{t}$ (but not independent of subsequent ones!), so

\[ \mathbb{E}_{\,}\left[\left\Vert\boldsymbol{\theta}^{t+1} - \thetavstar\right\Vert^2\right] \le \left\Vert\boldsymbol{I}_{\,}- \alpha \H\right\Vert_{op}^2 \mathbb{E}_{\,}\left[\left\Vert\boldsymbol{\theta}^{t} - \thetavstar\right\Vert^2\right] + \alpha^2 \sigma^2. \]

Previously, we were able to recursively expand the error to get an expontially decaying distance to $\thetavstar$, but we cannot in this case due to the $\alpha^2 \sigma^2$ term.

Generalization bounds

Theorem 3 of the reading states that, on a convex objective with bounded gradients,

\[ \mathbb{E}_{\,}\left[f(\boldsymbol{\theta}^t) - f(\thetavstar)\right] \le \frac{C}{\sqrt{t}}. \]

Astonishingly, in the case that we see each sample only once, this reasoning can be applied both to the empirical risk and population risk. In this sense, we get a generalization bound for SGD, using only the optimization algorithm.

Automatic differentiation

Both gradient descent and SGD require derivatives of your loss function. How to compute these derivatives? In the simple cases we study in class, we can usually compute them by hand. But for more complex models, including neural networks, complicated maximum likelihood estimators, compositions of several models, and so on, it may be tedious and error-prone to compue the derivative $\partial \mathscr{L}(\boldsymbol{z}_n, \theta) / \partial \theta$ by hand. Thanks to widespread, high–quality automatic differentiation (or just autodiff), it is now no longer necessary to compute gradients manually.

What automatic differentiation is not: numerical or symbolic differentiation

It’s traditional to begin a lecture on autodiff by saying what it is not, in order to avoid some common misconceptions.

Numerical differentiation

Suppose you have a computer function that computes a squared error loss, $f(\theta; x) = \frac{1}{2}(x- \theta)^2$:

def sq_loss(theta, x):
    return 0.5 * (x - theta)**2

How would you use the computer to compute a derviative? One way is “numerical differentiation,” in which we evaluate the function at two nearby points and divide by the difference:

\[ \nabla f(\theta; x) \approx \frac{f(\theta + \epsilon; x) - f(\theta; x)}{\epsilon} \]

def sq_loss_grad(theta, x, epsilon):
    return (sq_loss(theta + epsilon, x) - sq_loss(theta, x)) / epsilon

The problem is that this can be numerically unstable, especially in high dimensions or with highly curved functions. You typically don’t know in ## What automatic differentiation is not

You might say that Python is unable to differentiate sq_loss because it doesn’t have a symbolic representation of the formula. Tools like Wolfram Alpha keep track of sybolic representations of formulas that allow them to keep track of what is a variable and what is a number, and to perform mathematical operations the way we normally would.

The problem is that usually requires a specialized, verbose, and inefficient way of representing formulas. Furthermore, the intermediate expressions can grow quite complex, and so differentiating in this way can be inefficient.

What automatic differentiation is: the chain rule and function decomposition

First, note that our loss function is actually the composition of three functions of $\theta$. (For simplicity, we’ll keep $x$ fixed.)

\[ \begin{aligned} \phi_1 ={} & f_1(\theta) ={} x- \theta \\ \phi_2 ={}& f_2(\phi_1) ={} f_1^2 \\ \phi_3 ={}& f_3(\phi_2) ={} \frac{1}{2} f_2 \\ f(\theta; x) ={}& f_3 \circ f_2 \circ f_1(\theta) = f_3(f_2(f_1(\theta))). \end{aligned} \]

And each of these composite operations have simple derivatives:

\[ \begin{aligned} \nabla f_1(\theta) ={}& -1 \\ \nabla f_2(\phi_1) ={}& 2 \phi_1 \\ \nabla f_3(\phi_2) ={}& \frac{1}{2} \\ \end{aligned} \]

Finally, the derivative of a composition is the product of the derivatives via the chain rule:

\[ \begin{aligned} \nabla f(\theta; x) ={}& \nabla f_3(f_2(f_1(\theta))) \times \nabla f_2(f_1(\theta)) \times \nabla f_1(\theta) \end{aligned} \]

Note that we need to compute the lower–numbered functions first in order to know at what point to evaluate the higher–numbered derivatives. You could thus compute the function and its gradient in a single operation by computing:

$\phi_1 = f_1(\theta)$ and $J_1 = \nabla f_1(\theta) ={} -1$
$\phi_2 = f_2(\phi_1)$ and $J_2 = \nabla f_2(\phi_1) \times J_1 ={} -2 \phi_1 = -2(x- \theta)$
$\phi_3 = f_3(\phi_2)$ and $J_3 = \nabla f_3(\phi_2) \times J_2 ={} \frac{1}{2} (-2) (x- \theta) = \theta - x$
Return $f(\theta; x) = \phi_3$ and $\nabla f(\theta; x) = J_3$.

Note that:

This procedure correctly computes $\nabla f(\theta; x) = \theta - x$.
The full derivative is never computed or instantiated (unlike symbolic differentiation)
The derivatives are exact up to floating point precision with no tunable parameters (unlike numeric differentiation)

This method of computing derivatives is called “foward mode” differentiation because you trace forward along the computation path, accumulating derivatives as you go. You could also go the other way by first computing $\phi_1$, $\phi_2$, and $\phi_3$, and then accumulate derivatives in reverse order by tracing back along the differentiation path:

$R_3 = \frac{1}{2}$
$R_2 = R_3 \times \nabla f_2(\phi_1) = \frac{1}{2} 2 \phi_1 = x- \theta$
$R_1 = R_2 \times \nabla f_1(\theta) = -(x- \theta) = \theta - x$.

Although this example is only with scalars, memory requirements dictate that

Forward mode is preferable when differentiating a high–dimensional output with respect to a low–dimensional input
Reverse mode is preferable when differentiating a low–dimensional output with respect to a high–dimensional input

For this reason, reverse mode is preferable in neural networks. And in fact, reverse–mode automatic differentiation, together with stochastic gradient descent, is known as back–propagation.

Execise

In Python, using jax, compute the gradient of the quadratic loss function and compare with the exact answer and with numerical differentiation with a range of $\epsilon$.

References

Hardt, M., and B. Recht. 2021. “Patterns, Predictions, and Actions: A Story about Machine Learning.” arXiv Preprint arXiv:2102.05242.