VC dimension and uniform laws for zero–one loss

Author

Ryan Giordano

\[ % Generalization error \def\err{\textrm{Error}} \]

Goals

Give an example of the integrable Lipschitz conditions
Show that zero–one loss does not satisfy the integrable Lipschitz conditions
Define the Vapnik-Chervonenkis (VC) dimension
State (but do not prove) a ULLN in terms of the VC dimension

Reading

These notes can be supplemented with Hastie, Tibshirani, and Friedman (2009) Chapter 7.9. The final bound is based on Wainwright (2019) Chapter 4.3, which is more advanced by provides a more complete treatment. Finally, if you want to go further, there are nice video tutorials in lectures 7 and 8 of this online course.

Zero–one loss functions

Unfortunately, zero–one losses are not typically IL, since they depend non–smoothly on the classification function \(f\). For a simple example, take \(x\in \mathbb{R}^{}\), \(y\in \{0,1\}\) (so it’s a classification problem), and \(\mathscr{L}(\hat{y}, y) = \mathrm{I}\left(\hat{y}\ne y\right)\). Consider the simple threshold classifier

\[ f(x; \theta) = \mathrm{I}\left(x\ge \theta\right). \]

Then consider classification error at \(\theta = x\pm \varepsilon\) for small \(\varepsilon\) and \(y= 1\). We have:

\[ \mathscr{L}(f(x; x+ \varepsilon), 1) - \mathscr{L}(f(x; x- \varepsilon), 1) = 1 \quad\textrm{but}\quad \left\Vert x+ \varepsilon - (x- \varepsilon)\right\Vert = 2 \varepsilon. \]

Since \(\mathscr{L}(f(x; \theta), 1)\) is discontinuous in \(\theta\) at \(x\), the function cannot be IL. A different technique is needed to control generalization error for zero–one loss.

VC dimension

\[ \def\xvec{\mathcal{X}} \]

A key tool in controlling generalization error for functions taking values in \(\{0,1\}\) is Vapnik–Chervonenkis (VC) dimension. The VC dimension is a measure of complexity of set of functions, so we might write \(VC(\mathcal{F})\) for the VC dimension of the function class \(\mathcal{F}\), where each member \(f\in \mathcal{F}\) maps some domain \(x\) to binary values \(\{0, 1\}\). (The VC dimension can be defined for more general functions by reducing them to binary–valued functions; see Van der Vaart (2000) Chapter 19, for example.)

For a given set of \(N\) points, \(\xvec_N = (x_1, \ldots, x_N)\), a function \(f\) returns a binary vector \(f(\xvec) := (f(x_1), \ldots, f(x_N))\). As \(f\) ranges over \(\mathcal{F}\), \(f(\xvec)\) ranges over possible values of the \(2^N\) possible binary vectors.

For example, take our threshold classifier from above. If \(N=2\) and \(\xvec = (3, 4)\), then as \(\theta\) ranges from \(-\infty\) to \(\infty\), \(f(\xvec)\) goes from \((1,1)\) to \((0, 1)\) (as \(\theta\) passes \(3\)) to \((0, 0)\) (as \(\theta\) passes \(4\)). The value \((1, 0)\) is impossible to produce with any \(f\in \mathcal{F}\) and this \(\xvec\), since \(3 < 4\), and so \(4\) is classified as \(1\) whenever \(3\) is.

As we saw, for a given \(\xvec\), the number of distinct values of \(f(\xvec)\) may be more or less than the total number of possibly values \(2^N\). If \(f(\xvec)\) is able to take all possible \(2^N\) values as \(f\) ranges over \(\mathcal{F}\), then \(f\) is said to “shatter” \(\xvec\). Intuitively, \(\xvec\) gets harder to shatter as it contains more and more points. For example, the threshold function above can shatter a single point, but cannot shatter two points. Also intuitively, more expressive function classes can shatter larger and larger sets of points.

VC Dimension

For a given function class \(\mathcal{F}\) of functions taking values in \(\{0,1\}\), let \(VC(\mathcal{F})\) be the largest number \(N\) such that there exists a vector \(\xvec\) of length \(N\) that can be shattered by \(\mathcal{F}\). Equivalently, if \(M\) is smallest number such that no \(\xvec\) can be shattered by \(\mathcal{F}\), then \(VC(\mathcal{F}) = M - 1\).

Note that the VC dimension doesn’t mean you can shatter all sets of size \(VC(\mathcal{F})\). In fact, this is obviously impossible, since a set consisting of \(N\) copies of the same value cannot be shattered, no matter how complex \(\mathcal{F}\) is. The VC dimension requires only that you can find some set \(\xvec\) of size \(VC(\mathcal{F})\) that is shattered, but no such set \(VC(\mathcal{F}) + 1\).

We have already shown that the VC dimension of the boundary classifier is \(1\).

The passage from VC dimension to ULLNs is somewhat complex, and beyond the scope of this (undergraduate) course. An end–to–end treatment can be found in Wainwright (2019) chapter four, from which we simply state the result.

Let \(VC(\mathcal{F}) = \nu\), and \(f\in \mathcal{F}\) take values in \(\{0,1\}\). Then, by Wainwright (2019):

\[ \sup_{f\in \mathcal{F}} \left| \frac{1}{N} \sum_{n=1}^Nf(z_n) - \mathbb{E}_{\vphantom{}}\left[f(z)\right] \right| \le \sqrt{\frac{\nu \log (N+1)}{N}} + \delta \quad\textrm{with probability at least}\quad 1 - \exp\left(-N \delta^2 / 2 \right). \]

Specifically, this is a consequence of the following results from Wainwright (2019):

Proposition 4.18 and Definition 4.13 (VC dimension implies polynomial discrimination)
Lemma 4.14 and equation 4.24 (polynomial discrimination controls Rademacher complexity)
Theorem 4.10 (Rademacher complexity controls the ULLN)

Alternatively, see Section 6.5.2 of Shalev-Shwartz and Ben-David (2014).

Note that this result is stronger than our asymptotic results, in the sense that it is a finite–sample exact bound, rather than an (otherwise uncontrolled) guarantee as \(N \rightarrow \infty\). Finite–sample bounds are much easier to obtain when the function \(f(x)\) is bounded as a function of \(x\), as is the case for zero–one loss, but which is not typically the case for regression loss.

From classifiers to loss

You will show in your homework that if a classifier has finite VC dimension then its zero–one loss function also has finite VC dimension, and vice-versa.

References

Hastie, T., R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning: Data Mining, Inference and Prediction. 2nd ed. Springer.

Shalev-Shwartz, S., and S. Ben-David. 2014. Understanding Machine Learning: From Theory to Algorithms. Cambridge university press.

Van der Vaart, A. 2000. Asymptotic Statistics. Vol. 3. Cambridge university press.

Wainwright, M. 2019. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Vol. 48. Cambridge university press.