VC dimension and uniform laws for zero–one loss
\[ % Generalization error \def\err{\textrm{Error}} \]
Goals
- Give an example of the integrable Lipschitz conditions
- Show that zero–one loss does not satisfy the integrable Lipschitz conditions
- Define the Vapnik-Chervonenkis (VC) dimension
- State (but do not prove) a ULLN in terms of the VC dimension
Reading
These notes can be supplemented with Hastie, Tibshirani, and Friedman (2009) Chapter 7.9. The final bound is based on Wainwright (2019) Chapter 4.3, which is more advanced by provides a more complete treatment. Finally, if you want to go further, there are nice video tutorials in lectures 7 and 8 of this online course.
Examples of integrable Lipschitz loss
Recall that we proved a ULLN for functions of the form \(\phi(z, \theta)\) satisfying
\[ \left|\phi(z, \theta) - \phi(z, \theta')\right| \le M(z) \left\Vert\theta - \theta'\right\Vert \]
where \(\theta \in \Omega\) and \(\Omega\) is compact. We called these the ``integrable Lipschtiz’’ (IL) conditions. This is because a Lipschitz function of \(\theta\) satisfies
\[ \left|g(\theta) - g(\theta')\right| \le L \left\Vert\theta - \theta'\right\Vert \]
for some constant \(L\), and the IL condition allows this \(L\) to depend on \(z\), but in a way that is ``integrable’’ by an integrable function \(M(z)\).
To use this result to control generalization error, we want to apply this to functions of the form
\[ \sup_{f\in \mathcal{F}}\left|\frac{1}{N} \sum_{n=1}^N\mathscr{L}(f(x_n), y_n) - \mathbb{E}_{\,}\left[\mathscr{L}(f(x), y)\right]\right|. \]
So we might take \(z= (x, y)\), a parametric learner \(f(\cdot, y)\), and
\[ \phi(z, \theta) = \mathscr{L}(f(x, \theta), y). \]
OLS loss
For example, taking squared loss and linear regression, we get
\[ \phi(z, \theta) = (\boldsymbol{x}^\intercal\theta - y)^2 = y^2 - 2 y\boldsymbol{x}^\intercal\theta + \mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal\theta \theta^\intercal\right). \]
Let us assume that \(\Omega\) is compact. (One can show that \(\Omega\) can be safely taken to be compact with high probability as \(N\) grows, but I will leave that as a potential homework problem.) We then have, by the triangle inequality and Cauchy–Schwartz,
\[ \begin{aligned} \phi(z, \theta) - \phi(z, \theta') ={}& - 2 y\boldsymbol{x}^\intercal(\theta - \theta') + \mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal(\theta \theta^\intercal- \theta' {\theta'}^\intercal\right). \\={}& - 2 y\boldsymbol{x}^\intercal(\theta - \theta') + \mathrm{trace}\left(\boldsymbol{x}\boldsymbol{x}^\intercal(\theta + \theta') (\theta - \theta')^\intercal\right) \quad\Rightarrow\\ \left|\phi(z, \theta) - \phi(z, \theta')\right| \le{}& 2 y\left\Vert\boldsymbol{x}\right\Vert \left\Vert\theta - \theta'\right\Vert + \left\Vert\boldsymbol{x}\right\Vert^2 \left\Vert\theta + \theta'\right\Vert \left\Vert\theta - \theta'\right\Vert \\\le{}& \left( 2 y\left\Vert\boldsymbol{x}\right\Vert \left\Vert\theta - \theta'\right\Vert + \left\Vert\boldsymbol{x}\right\Vert^2 \left\Vert\theta + \theta'\right\Vert \right) \left\Vert\theta - \theta'\right\Vert \\\le{}& 2 \left( y\left\Vert\boldsymbol{x}\right\Vert \left\Vert\theta - \theta'\right\Vert + \left\Vert\boldsymbol{x}\right\Vert^2 \sup_{\theta \in \Omega} \left\Vert\theta\right\Vert \right) \left\Vert\theta - \theta'\right\Vert \end{aligned} \]
where \(\sup_{\theta \in \Omega} \left\Vert\theta\right\Vert\) is finite because \(\Omega\) is compact. We can thus take
\[ M(z) = 2 \left( y\left\Vert\boldsymbol{x}\right\Vert \left\Vert\theta - \theta'\right\Vert + \left\Vert\boldsymbol{x}\right\Vert^2 \sup_{\theta \in \Omega} \left\Vert\theta\right\Vert \right), \]
and see that quadratic loss satisfies the IL condition, and so a ULLN. (Though recall that we can show directly that this loss obeys a ULLN, as you have done in your homework.)
Bounded derivatives
In general, it suffices to show that the derivative \(\frac{\partial \phi(z; \theta)}{\partial \theta}\) is integrable by an integrable \(M(z)\). This is because, for any \(\theta_1\) and \(\theta_0\), we can define a path \(\theta(t) = \theta_0 + (\theta_1 - \theta_0) t\) and write
\[ \begin{aligned} \phi(z; \theta_1) ={}& \phi(z; \theta_0) + \int_{0}^1 \frac{\partial \phi(z; \theta(t))}{\partial t} d t \\={}& \phi(z; \theta_0) + \int_{0}^1 \left. \frac{\partial \phi(z; \theta)}{\partial \theta} \right|_{\theta(t)} (\theta_1 - \theta_0) d t, \end{aligned} \]
giving
\[ \begin{aligned} \left|\phi(z; \theta_1) - \phi(z; \theta_0)\right| \le{}& \int_{0}^1 \left|\left. \frac{\partial \phi(z; \theta)}{\partial \theta} \right|_{\theta(t)} (\theta_1 - \theta_0)\right| d t \\\le{}& \int_{0}^1 \left\Vert\left. \frac{\partial \phi(z; \theta)}{\partial \theta} \right|_{\theta(t)}\right\Vert d t\left\Vert\theta_1 - \theta_0\right\Vert \\\le{}& \sup_{\theta \in \Omega} \left\Vert\left. \frac{\partial \phi(z; \theta)}{\partial \theta} \right|_{\theta}\right\Vert d t\left\Vert\theta_1 - \theta_0\right\Vert \end{aligned} \]
Therefore, if we can take
\[ M(z) = \sup_{\theta \in \Omega} \left\Vert\left. \frac{\partial \phi(z; \theta)}{\partial \theta} \right|_{\theta}\right\Vert, \]
and show that \(\mathbb{E}_{\,}\left[M(z)\right] < \infty\), then the IL property follows.
Zero–one loss functions
Unfortunately, zero–one losses are not typically IL, since they depend non–smoothly on the classification function \(f\). For a simple example, take \(x\in \mathbb{R}^{}\), \(y\in \{0,1\}\) (so it’s a classification problem), and \(\mathscr{L}(\hat{y}, y) = \mathrm{I}\left(\hat{y}\ne y\right)\). Consider the simple threshold classifier
\[ f(x; \theta) = \mathrm{I}\left(x\ge \theta\right). \]
Then consider classification error at \(\theta = x\pm \varepsilon\) for small \(\varepsilon\) and \(y= 1\). We have:
\[ \mathscr{L}(f(x; x+ \varepsilon), 1) - \mathscr{L}(f(x; x- \varepsilon), 1) = 1 \quad\textrm{but}\quad \left\Vert x+ \varepsilon - (x- \varepsilon)\right\Vert = 2 \varepsilon. \]
Since \(\mathscr{L}(f(x; \theta), 1)\) is discontinuous in \(\theta\) at \(x\), the function cannot be IL. A different technique is needed to control generalization error for zero–one loss.
VC dimension
\[ \def\xvec{\mathcal{X}} \]
A key tool in controlling generalization error for functions taking values in \(\{0,1\}\) is ``Vapnik–Chervonenkis’’ (VC) dimension. The VC dimension is a measure of complexity of set of functions, so we might write \(VC(\mathcal{F})\) for the VC dimension of the function class \(\mathcal{F}\), where each member \(f\in \mathcal{F}\) maps some domain \(x\) to binary values \(\{0, 1\}\). (The VC dimension can be defined for more general functions by reducing them to binary–valued functions; see Van der Vaart (2000) Chapter 19, for example.)
For a given set of \(N\) points, \(\xvec_N = (x_1, \ldots, x_N)\), a function \(f\) returns a binary vector \(f(\xvec) := (f(x_1), \ldots, f(x_N))\). As \(f\) ranges over \(\mathcal{F}\), \(f(\xvec)\) ranges over possible values of the \(2^N\) possible binary vectors.
For example, take our threshold classifier from above. If \(N=2\) and \(\xvec = (3, 4)\), then as \(\theta\) ranges from \(-\infty\) to \(\infty\), \(f(\xvec)\) goes from \((1,1)\) to \((0, 1)\) (as \(\theta\) passes \(3\)) to \((0, 0)\) (as \(\theta\) passes \(4\)). The value \((1, 0)\) is impossible to produce with any \(f\in \mathcal{F}\) and this \(\xvec\), since \(3 < 4\), and so \(4\) is classified as \(1\) whenever \(3\) is.
As we saw, for a given \(\xvec\), the number of distinct values of \(f(\xvec)\) may be more or less than the total number of possibly values \(2^N\). If \(f(\xvec)\) is able to take all possible \(2^N\) values as \(f\) ranges over \(\mathcal{F}\), then \(f\) is said to ``shatter’’ \(\xvec\). Intuitively, \(\xvec\) gets harder to shatter as it contains more and more points. For example, the threshold function above can shatter a single point, but cannot shatter two points. Also intuitively, more expressive function classes can shatter larger and larger sets of points.
For a given function class \(\mathcal{F}\) of functions taking values in \(\{0,1\}\), let \(VC(\mathcal{F})\) be the largest number \(N\) such that there exists a vector \(\xvec\) of length \(N\) that can be shattered by \(\mathcal{F}\). Equivalently, if \(M\) is smallest number such that no \(\xvec\) can be shattered by \(\mathcal{F}\), then \(VC(\mathcal{F}) = M - 1\).
Note that the VC dimension doesn’t mean you can shatter all sets of size \(VC(\mathcal{F})\). In fact, this is obviously impossible, since a set consisting of \(N\) copies of the same value cannot be shattered, no matter how complex \(\mathcal{F}\) is. The VC dimension requires only that you can find some set \(\xvec\) of size \(VC(\mathcal{F})\) that is shattered, but no such set \(VC(\mathcal{F}) + 1\).
We have already shown that the VC dimension of the boundary classifier is \(1\).
The passage from VC dimension to ULLNs is somewhat complex, and beyond the scope of this (undergraduate) course. An end–to–end treatment can be found in Wainwright (2019) chapter four, from which we simply state the result.
Let \(VC(\mathcal{F}) = \nu\), and \(f\in \mathcal{F}\) take values in \(\{0,1\}\). Then, by Wainwright (2019):
\[ \sup_{f\in \mathcal{F}} \left| \frac{1}{N} \sum_{n=1}^Nf(z_n) - \mathbb{E}_{\,}\left[f(z)\right] \right| \le \sqrt{\frac{\nu \log (N+1)}{N}} + \delta \quad\textrm{with probability at least}\quad 1 - \exp\left(-N \delta^2 / 2 \right). \]
Specifically, this is a consequence of the following results from Wainwright (2019):
- Proposition 4.18 and Definition 4.13 (VC dimension implies polynomial discrimination)
- Lemma 4.14 and equation 4.24 (polynomial discrimination controls Rademacher complexity)
- Theorem 4.10 (Rademacher complexity controls the ULLN)
Note that this result is stronger than our asymptotic results, in the sense that it is a finite–sample exact bound, rather than an (otherwise uncontrolled) guarantee as \(N \rightarrow \infty\). Finite–sample bounds are much easier to obtain when the function \(f(x)\) is bounded as a function of \(x\), as is the case for zero–one loss, but which is not typically the case for regression loss.
Exercise: Show that if a function class \(f(\boldsymbol{x})\) for \(f\in \mathcal{F}\) has finite VC dimension, then the function class \(\{ \phi(\boldsymbol{x}, y) = \mathrm{I}\left(f(\boldsymbol{x}) \ne y\right) \textrm{ for }f\in \mathcal{F}\}\) also has finite VC dimension. Conclude that a ULLN holds for zero–one loss when using a classifier with finite VC dimension.