Generalization error and uniform laws
\[ % Generalization error \def\err{\textrm{Error}} \]
Goals
- Introduce generalization error, a complexity cost for generic loss functions
- Express complexity as the expressiveness of a function class
- Present uniform bounds as one (conservative) way to control complexity cost
A generic complexity cost
For OLS estimators and least squares loss, we’ve derived a complexity penalty that takes the form of a parametric variance. Specifically, we showed that the loss incurred by using \(\hat{\boldsymbol{\beta}}\) for prediction, \(\mathscr{L}(\hat{\boldsymbol{\beta}}^\intercal\boldsymbol{x}, y)\), can be decomposed into a bias term that improves with complexity, an irreducible loss that cannot be improved, and a variance term:
\[ \mathrm{trace}\left(\mathbb{E}_{\,}\left[\boldsymbol{x}\boldsymbol{x}^\intercal\right] \mathrm{Cov}_{\,}\left(\hat{\boldsymbol{\beta}}\right)\right) \quad\textrm{(Variance term)}\quad \]
which has the potential to increase with the complexity of the model. In a certain sense, if we believe in the IID assumption, then this variance term is the only reason to not increase the complexity of your regression without bound.
However, these arguments break down if we consider alternative loss functions, such as absolute \(\mathscr{L}(\hat{y}, y) = \left|\hat{y}- y\right|\) or classification loss \(\mathscr{L}(\hat{y}, y) = \mathrm{I}\left(\hat{y}< c \textrm{ and }y> c\right)\), since we cannot decompose the loss into bias and variance contributions.
Furthermore, unlike for OLS, for these losses we don’t have explicit forms for the empirical loss–minimizing parameters. At a deeper level, it seems wrong to express the complexity of our functions in terms of a parameter at all. Shouldn’t the complexity be measured in terms of the class of prediciton rules themselves, rather than their representation?
To this end, let’s consider a more generic decomposition of loss that allows us to go beyond least squares and OLS. Recall our three types of functions:
\[ \begin{aligned} f^{\star}={}& \underset{f}{\mathrm{argmin}}\, \mathscr{L}(f) \\ \overline{f}={}& \underset{f\in \mathcal{F}}{\mathrm{argmin}}\, \mathscr{L}(f)\\ \hat{f}={}& \underset{f\in \mathcal{F}}{\mathrm{argmin}}\, \hat{\mathscr{L}}(f)\\ \textrm{where}\\ \mathscr{L}(f) ={}& \mathbb{E}_{new}\left[\mathscr{L}(f(\boldsymbol{x}_\mathrm{new}), y_\mathrm{new})\right] \\ \hat{\mathscr{L}}(f) ={}& \frac{1}{N} \sum_{n=1}^N\mathscr{L}(f(\boldsymbol{x}_n), y_n) \\ \end{aligned} \]
Here, \(f^{\star}\) is the best we could do, \(\overline{f}\) is the best we could do in our class of functions \(\mathcal{F}\), and \(\hat{f}\) is the computable estimate based on our data. We want to quantify the complexity cost of \(\mathcal{F}\), since by making \(\mathcal{F}\) more expressive, we can only make \(\mathscr{L}(\overline{f})\) closer to \(\mathscr{L}(f^{\star})\).
In loose analogy with the bias–variance decomposition, we can add and subtract to express the cost of using \(\hat{f}\) as the sum of three terms:
\[ \begin{aligned} \mathscr{L}(\hat{f}) ={}& \underset{\textrm{Sampling error}}{\mathscr{L}(\hat{f}) - \mathscr{L}(\overline{f})} + \underset{\textrm{Approximation error}}{\mathscr{L}(\overline{f}) - \mathscr{L}(f^{\star})} + \underset{\textrm{Irreducible error}}{\mathscr{L}(f^{\star})}. \end{aligned} \]
The approximation error is like “bias,” and only improves with complexity. The irreducible error cannot be improved, no matter how expressive \(\mathcal{F}\) is. If there is a penalty to complexity, it can only come from the “sampling error,” the cost of using the empirical loss as a substitute for true loss. This was the “variance” term in OLS.
We can further decompose the sampling error into three parts, only one of which can possibly be problematic, at least for large \(N\): \[ \begin{aligned} \mathscr{L}(\hat{f}) - \mathscr{L}(\overline{f}) ={}& \underset{\textrm{Generalization error}}{\mathscr{L}(\hat{f}) - \hat{\mathscr{L}}(\hat{f})} + \underset{\le 0}{\hat{\mathscr{L}}(\hat{f}) - \hat{\mathscr{L}}(\overline{f})} + \underset{\rightarrow 0 \textrm{ by the LLN}}{\hat{\mathscr{L}}(\overline{f}) - \mathscr{L}(\overline{f})} \ge 0. \end{aligned} \]
Prove that \(\mathscr{L}(\hat{f}) - \mathscr{L}(\overline{f}) \ge 0\) and \(\hat{\mathscr{L}}(\overline{f}) - \hat{\mathscr{L}}(\overline{f}) \le 0\).
The reason the final term goes to zero by a LLN is that \(\overline{f}\) is fixed, and does not depend on the data. In contrast, we choose \(\hat{f}\) to minimize \(\hat{\mathscr{L}}(\cdot)\), and so might expect that \(\hat{\mathscr{L}}(\hat{f})\) is an unduly optimistic estimate of our true loss — namely that \(\hat{\mathscr{L}}(\hat{f}) \le \mathscr{L}(\hat{f})\). The difference \(\mathscr{L}(\hat{f}) - \hat{\mathscr{L}}(\hat{f})\) is sometimes called the “generalization error,” because it refers to the fact that the loss on your observed data may not “generalize” to future, unseen data. The above decomposition shows that a complexity cost must, asymptotically, come from generliazation error.
Uniform laws of large numbers
As with the bias–variance tradeoff, we argued that complexity may — but does not necessarily — increase variance. However, we argued that, if the variance is controlled (e.g. by not having too many parameters), then the risk was also controlled. Similarly, we will now focus on situations when we can control the generalization error, while acknowledging that this control may be too strict, and that complex functions may generalize while failing these controls.
Specifically, the generalization error is small if
\[ \begin{aligned} \left|\mathscr{L}(\hat{f}) - \hat{\mathscr{L}}(\hat{f})\right| = \left| \frac{1}{N} \sum_{n=1}^N\mathscr{L}(\hat{f}(x_n), y_n) - \mathbb{E}_{new}\left[\mathscr{L}(\hat{f}(x_\mathrm{new}), y_\mathrm{new})\right] \right| \rightarrow 0 \quad\textrm{(we want this)}\quad \end{aligned} \]
The problem is that \(\hat{f}\) depends on the data in a perhaps very complicated way, and so the observations \(\mathscr{L}(\hat{f}(x_n), y_n)\) are not independent — they are all dependent on one another through the dependence of \(\hat{f}\) on all the data. There are roughly two routes to controlling this dependence, and they correspond roughtly to the three ways of controlling generalization error in machine learning:
- By showing that the procedure that generated \(\hat{f}\) cannot depend too strongly on the data, or
- By showing that the loss function cannot depend too much on what \(\hat{f}\) is.
The former requires articulating something about how we get \(\hat{f}\), and so typically requires saying something about a particular algorithm in a particular context. The latter is more general, but may be more loose and so too conservative. We will now consider the latter, and try to give some examples of the former later in the course when discussing particular algorithms, like the perceptron algorithm and stochastic gradient descent.
Specifically, we will show the stronger statement that
\[ \begin{aligned} \left|\hat{\mathscr{L}}(\hat{f}) - \mathscr{L}(\hat{f})\right| \le{}& \sup_{f\in \mathcal{F}} \left|\hat{\mathscr{L}}(f) - \mathscr{L}(f)\right| \\={}& \sup_{f\in \mathcal{F}} \left| \frac{1}{N} \sum_{n=1}^N\mathscr{L}(f(x_n), y_n) - \mathbb{E}_{new}\left[\mathscr{L}(f(x_\mathrm{new}), y_\mathrm{new})\right] \right|. \quad\textrm{(a "uniform" law of large numbers)}\quad \end{aligned} \]
Note that this is an asymptotic statement, where the function class \(\mathcal{F}\) is fixed as \(N \rightarrow \infty\). Most modern ML actually focuses on finite–sample results that look something like
\[ \mathbb{P}_{\,}\left(\sup_{f\in \mathcal{F}} \left|\hat{\mathscr{L}}(f) - \mathscr{L}(f)\right| \ge \varepsilon\right) \le C(N, \mathcal{F}, \varepsilon), \]
for some explicit function \(C(\cdot)\). Typically one would hope that, all else fixed, \(C \rightarrow 0\) as \(N \rightarrow \infty\). Furthermore, one usually has \(C \rightarrow 1\) as \(\varepsilon \rightarrow 0\), so you cannot control arbitrarily small changes for any finite \(N\), and the dependence on \(\mathcal{F}\) is somewhat explicit, so you can imagine changing the complexity of your function class as the dataset grows. These results are more realistic but a little more complicated, and so beyond the scope of a course like this.
Howver, in at least some cases, the asymptotic results are simpler to understand, and rely on a similar style of argument to the finite–sample results. So the asymptotic results can give some intuition for the more sophisticated results that you might encounter in ML research or in further study.